MASSACHUSETTS INSTITUTE OF TECHNOLOGY 
ART! FI Q AL I NIELLI GENCE LABORATORY 


AI. Technical Beport No. 1224 


February, 1990 


Fat-Tree Routing for Tansit 


Andre DeHon 

andr e @ai .mit.edu 


Abstract: As an al t er nat i ve t o us i ng a bi del t a net work t opol ogy for 1 ar ge 'll ansi t net¬ 
works, I consider the requirements to extend the base Ttansit network into Leiserson’s 
Fat-Uee configuration. Transit will be a high-speed, low-latency, f aul t -1 ol er ant network 
i nt er connect i on f or hi gh per f or nance mil ti - processor conputers. The initial interconnect 
s cheme pi anned f or 'll ansi t will useabideltastyle net work t o support up to 256 pr oces s or s . 
Seal i ng beyond 256 processors by si npl y ext endi ng t hat net work t opol ogywill result ina 
uni f or mdegr adat i on of net work latencyacross processors. Af at- tree net work structure will 
allowthe Transit network to be scaled arbitrarily while taking advantage of the locality 
and universality of fat-trees to nininize the inpact of scaling upon network latency. I 
cons i der t he t opol ogy and cons t r uct i on i s sues for i nt egr at i ng t he 'll ansi t rout i ng net work 
conponent and t echnol ogy i nt o a f at -1 r ee configur at i on. I al s o char act er i ze t he r esul t i ng 
net work’s size, locality, and per f or nance and conpar e t hes e char acteristics withthose of 
bi de 11 a ne t wor ks . 
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1. Introduction 


1. 1 Overvi ew 

Transit [Kni ght 89] is a hi gh-per f or nance network for large scale M M) conput er s . 

Tkansi t is i nt ended t o pr ovi de t he under 1 yi ng net work suppor t for a wi de range of par al lei 
processing paradigm including ness age passing, shared memory, and dataflow. Tkan- 
sit provides 1 ow-1 atency i nterconnect between processors vi a an i ndi r ect circuit switched 
network. The network is cons t r uct ed as a bi del t a mil t i - s t age shuffle exchange network 
[ Kruskal 86] utilizing4x4cross bar r out i ng el ement s . The Tk ansi t bi del t a net work al 1 ows 
connecti ons between source and desti nati on to be nade through a f ewrouti ng stages for 
moderate size net works . In or der t o pr ovi de f aul t -1 ol er ance and i nprove net work rout i ng 
effci ency, redundant paths are pr ovi ded through the network, flouting switches are kept 
sinple, and thus fast, by i npl enenti ng a source responsi bl e connecti on protocol; this frees 
t he i ndi vi dual routi ng el enent s f romthe conpl exi t y of col 1 i si on avoi dance. 

For moderate sized net works (i.e. 64 to 256 process or s), thebidelta net work cons t ruc¬ 
tion keeps t he del ay uni f or id y snal 1 betweenall processors inthe net work. Addi t i onal 1 y, 
usi ng t hr ee - di nensi onal wiring strategies, the size and wi r e 1 engt h can be kept re as onabl y 
s nal 1. Thi s al 1 ows t he net work t o be cons tructedina si ngl e physi cal package wi t hi n cur r ent 
t e chnol ogy 1 i id t at i ons . 

Seal i ng to 1 ar ger net work sizes by si npl y ext endi ng t hi s s t r at egy 1 eads to a fewdiffcul- 
t i es . Net work del ays be cone uni f or id y 1 ar ge. The net work itself be cones 1 ar ge enough t hat 
i t mist be di vi ded i nt o mil ti pi e part s f or packagi ng. In addi t i on, wi re 1 engths grow such 
t hat r educi ng t he cl ock r at e of t he s ys t eip due to wi r e del ays , be cones a s er i ous concern. 

Tb avoid this uni f ormperf or nance degradation, I propose a scheme for constructing 
1 ar ger net works us i ng Lei s er s on’s vol une- uni ver s al f at -1 r ee net work [Leiserson85]. Inthi 
nanner, 1 ocal i ty can be expl oi ted to provi de short i nterconnect between cl osel y si tuated 
processors; at the same time, i nt er connect i on bet ween nor e distant processors is still possi 
bl e wi t hout si gni ficant per f or nance r educt i on f r omt hat of t he bi del t a net work. The f at -1 r ee 
net work al s o all ows t he net work t o be deconpos ed i nt o cons t i t uent el enent s for packagi ng 
i n a s t r ai ght f or war d nanner . 

The extensionto a f at -tree net work s cheme pr i mar i 1 y i npact s i nt er connect i on t opol ogy 
and r out i ng s cheme s . I des cr i be how t he f at -1 r ee rout i ng net work can be realizedutilizing 
the existing Tkansit routing element and packaging technology. This extension can be 
i npl ementedwithni nor r evi si ons t o t he Tk ansi t rout i ng conponent. 

In the remainder of this chapter, I briefly review the relevant properties of Tkansit 
( Se c t i ons 1.2 t hr ough 1.5) and fat-trees (Section 1.6). In Qiapt e r 2, I de ve 1 op s one of t he 
topology issues for building the fat-tree network structure. Chapter 3 follows providing 
concrete pos s i bi 1 i t i es for t he r eal i zat i on of t he f at -1 r ee net work. Chapt er 4 t hen proceeds i 
quant i f y s one of t he pr oper t i es of t he net work and conpar es i t wi t h Tkansi t bi del t a net works 
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Fi gure 1.1: Net work Processor I nt erf ace 


of conpar abl e si ze. Fi nal 1 y, Giapt er 5 closes with concl usi ons and f ut ur e di r ect i ons . 

1.2 Processor ]>fetwork Interface Model 

The devel opnent provi ded throughout this paper is concerned enti rel y wi th the con¬ 
st rue t i on of i nt er connect i on net works . I n or der f or t he net work t o be us ef ul i n t he cont ext 
of a 1 ar ge s cal e par al 1 el conput er, i t mis t i nt er f ace coherent lywithits set of process or: 
The net work processor interface is shown i n Fi gure 1.1. Each net work endpoi nt is a pro¬ 
cessor withits own 1 ocal memory and a cache - cont r ol 1 er for nai nt ai ni ng i t s 1 ocal cache 
and keepi ng t he cache coherent withthe rest of the net work. Eachprocessor has a pai r of 
i nput s and a pai r of out put s t o t he net work. TTansi t desi gn i nt ends t he cache - cont r ol 1 er t o 
serve as the processor’s interface to the net work sot hat t he pr oces s or need not expl i ci 11 y 
deal wi t h net work i nt er act i ons ; however, thi s i s a separate archi tectural i ssue and need not 
necessari1y be the case. 

The net work input and out put connections are pai red for fault tolerance andinproved 
rout i ng s ucces s pr obabi 1 i t i es . Faul t t ol e ranee issues are discussedfurther inSectionl. 
The processor will generally only use one of its two inputs to the network, guaranteeing 
that the network wi 11 never be 1 oaded above 50%. 

1.3 Unit i ng Cbup one ill 

Basi c switchingis provi ded by t he rout i ng el enent, ENL. Thi s is a cus t omCMDS 
routi ng conponent desi gned to provi de si npl e hi gh speed swi t chi ng. BN1 has ei ght ni ne- bi t 
wi de i nput channel s and ei ght ni ne- bi t wi de out put channel s . Thi s provi des byt e wi de dat a 
t r ansf er wi t h t he ni nt h bi t s er vi ng as a si gnal for t he begi nni ng and end of t r ansni s si ons . 
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Figure 1.2: ENL Logi cal Configurations 


RN1 can be configured in one of two ways as shown in Figure 1.2. ENL’s prinary 
configur at ion is as a4x4 cros sbar rout er wi t h 2 equi val ent out put s ineachlogical di recti on. 
Inthis configuration, all 8 i nput channel s are 1 ogi cal 1 y equi val ent. Alternately, BNl can be 
configured as a pai r of 4x4 crossbar s, eachwith4 1 ogi cal 1 y equi val ent i nput s and a si ngl e 
output ineachlogical direction. 

Si npl e rout i ng i s per f or med by us i ng t he fir s t t wo bi t s of a t r ansni s si on t o i ndi cate t he 
t he de s i r e d out put de s t i nat i on. I f an out put inthe de sired di recti on is avai 1 abl e, t he dat; 
t r ansni s si on is r out ed t o one such out put. Ot her wi s e, t he dat a i s i gnor ed. Ineither case, 
when the transni ssi on conpl etes , BNl i nf orns the sender of the connecti on status so that 
t he s ender wi 11 know whet her or not it is necessarytoretrythe t r ansni ssi on. 

To al 1 owr api d res pons es t o net work r eques t s , BNl al 1 ows connecti ons opened over t he 
network to be turned around; that is, the direction of the connection can be reversed 
al 1 owi ng dat a t o flo whack f r omt he des t i nat i on t o t he s our ce pr oces s or. The abi lityto turn 
a net work connecti on ar ound al 1 ows a pr oces s or r eques t i ng dat atoget its r espons e qui ckl y 
without requi ri ng the processor it is comruni cati ng wi th to open a separate connection 
through the net work. 

Si nee ENL al ways 1 ooks at the most si gni ficant t wo bi t s of the fir st byte of a transni ssi on 
to deternine switching di recti on, it is necessaryfor thebits withinthis byt e to be r ot at ei 
bet ween net wor k s t ages . Thi s rot at i on guar ant ees each net wor k s t age a fresh set of bits 
f r omt he r out i ng byte. The rot at i on pr oper t y mis t be pr ovi ded by t he net work wi r i ng 
s chene i n whi ch ENL is us ed. 

Wen configured as a 4 X 4 cross bar wi t h t wo out put s i n each 1 ogi cal di r ect i on, BNl 
provides redundant paths through the network since it provides mil ti pie outputs in each 
1 ogi cal di r ect i on. Thi s serves toincrease hot h f aul t t ol e ranee and t he pr obabi 1 i t y of rout i: 
succes s . Wen hot h 1 ogi cal 1 y equi val ent out put s ar e avai 1 abl e, BNl r andoiri y selects one of 
t he t wo f or us e. In t hi s manner, net wor ks bui 11 f r omBNl wi 11 have a 1 eve 1 of f aul t t ol er ance 
i n t hat mil t i pi e at t enpt s to s end a t r ansni s si on t hr ough t he net wor k will most likelyt ake 
separate paths through the network. The randomsel ecti on of the preferred output port 
i n each di rect i on gi ves transni s si ons a good chance of avoi di ng any f aul t y conponent ( s ) 
entirely in successive connecti on at t enpt s . 

The al t er nat i ve configur at i on f or ENL as a pai r of 4x4 cros sbar sis pr ovi ded f or f ur t her 
fault tolerance. As Section 1.2 described, there are two outputs fromthe network for 
each processor. Iking only the standard BNl configuration, these two outputs would 
have to cone froma single routing conponent naki ng that routing conponent critical to 
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Figure 1.3: ENL Package (II agramcourtesy of IT ed Et enckhahn) 


the proper f unct i oni ng of t he net wor k [ Lei ght on 89-2] . Wththis alternate configuration, 
the two outputs fromthe network for each processor can cone fromtwo different ENL 
conponent s so nei t her conponent is critic al. Sectionl.5 expl ai ns t hi s f aul t t ol e ranee i s su 
inthe context of the Ttansit bi del t a net works . 

The Ttansi t routi ng conponent i s descri bed further i n [ Kni ght 89] and [ Mnsky 90] . 


1.3.1 Physical Inscription 

ENL wd 11 be packaged as a pad gri d array wi th al 1 si gnal s appeari ng on pads on both 
si des of the package. Fi gure 1. 3 shows a top and si de vi ewof the ENL package. Atotal of 
76 peri phery pads will provi de through routi ng condui t s for signals. Hales in the package 
are provided for packaging alignment and coolant flow. Since ENL is designed to drive 
144 s i gnal pi ns si mil t aneousl y at lOOMh, provi si ons f or 1 i qui d cool i ng are es s ent i al. The 
t ar get physi cal statistics for ENL ar e suimar i zed i n Thbl e 1. 1. 
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1.3.2 Routing to n»re than 256 destinations 

Usi ng t he fir s t byt e to specify rout i ng des t i nat i ons as des cri bed above be cones 1 i ni t - 
i ng when attempting to address a large number of destinations. A single byte can only 
di s t i ngui sh 256 di s t i net des t i nat i ons . 

To address this potential problem} BN1 has a pr ovi si on f or droppi ng the 1 eadi ng byte 
of a streamof data before 1 ooki ng at i t. It then uses the renai nder of the streamas i f the 
fir s t byt e never exi s t ed. Thus, BN1 s t ar t s usi ng t he newr out i ng byt e, whi ch ms or i gi nal 1 y 
the second byte in the message, at the stage where the first ms dropped. By properly 
designating the stages in the network at which components should drop, or swallov? the 
fir s t byt e i n t hi s manner , we can specify arbitrarily many di s t i net des t i nat i ons . 

Thi s sudlowprqnty of a network routi ng stage is astatic property. In the si npl y case 
of bi del t a net works , t he s ml 1 ow propertycanbeset on a per chipbasis, si nee all i nput s 
t o a net mr k s t age will need fresh rout i ng byt es at t he s ame t i me. For f ul 1 gener al i t y i n 
t he net mr ks consi der ed her e, it is benefici al t o be abl e t o configur e t hi s pr oper t y on a per 
input basis. This allows a single routing component to have inputs that connect to paths 
of varyi ng 1 engths . 

1.4 Cbnstruction Technology 

The basic unit of network packaging for Tlansit is the stack. A stack is a three- 
di nensi onal i nter connect structure constructed by sandwi chi ng 1 ayers of BN1 routi ng com 
ponent s bet ween hor i zont al pc - boar d 1 ayers . The pc - boar ds per f or mi nt er - s t age wi r i ng and 
the bi t rotati ons des cri bed i n the previ ous secti on whi 1 e the routi ng stages provi de swi t ch- 
i ng. Fi gur e 1.4 shows a parti al cross-sectionof a stack. The doni nant di r ect i on of si gnal 
flow i s vertical as connections are made ver t i cal 1 y t hr ough t he stack. At each horizon¬ 
tal routi ng 1 ayer, each path through the network wi 11 make a connecti on through a si ngl e 
rout i ng component. Bet ween rout i ng 1 ayer s , t he connecti on i s r out ed hor i zont al 1 y to t he 
appropriate routing component in the next layer. 

Wen the transni ssi on reaches the top of the routi ng stack i t i s brought strai ght down, 
back t hr ough t he s t ack, t o connect t o t he des t i nat i on pr oces s or s . Thi s is ne cess ary becaus e 
t he set of s our ce and des t i nat ionprocessors will normal 1 y be t he s ame. Al 1 r out i ng t hr ough 
the layers of routing components is provided by the through routing pins on the BN1 
package as des cri bed i n the previ ous section. 
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Fi gure 1. 4: doss - Sectionof Bidelta Bout i ng St ack (II agr amcour tesyof IT ed Ft enckhahn) 

Cbnt act i s nade bet ween t he r out i ng conponent s and t he hori zont al pc- boar ds t hr ough 
but ton boar d car r i er s . These carriers are thin boar ds , r ought y t he s ane size as t he rout i ng 
chip, wi t h but t on bal 1 s [Sm)lley85] al i gned t o each pad on t he r out i ng chi p. These button 
bal Is are 25nicron spun wi r e conpr es s ed i nt o 20 ni 1 di amet er by 40 ni 1 hi gh hoi es i n t he 
but t on boar d connect or . They provide miltiple points of contact bet ween each routi ng 
conponent and hor i z ont al board when the stack is conpr es s ed t oget her; in this nanner 
t hey effect goodelectrical cont act wi t hout t he need f or s ol der. Thi s al 1 ows ease of packagi ng 
constructi on and conponent repl acenent. 

Channel s ar e pr ovi ded hot h i n t he stack and t hr ough each r out i ng conponent f or 1 i qui d 
cool i ng. FOG 77 FI uor i ner t wi 11 be punped t hr ough t hes e channel s t o pr ovi de effci ent heat 
removal duri ng operati on. 

At the targeted clock rate of lOOMlz for network operation, wire delay consumes a 
significant portion of the clock cycle. Thus, the physical size of the horizontal routing 


6 























































boards is aninportant consi derati on f or net work perf or nance. Additionally, wdthcurrent 
t echnol ogy f or f abr i cat i ng pc - boar ds , it is not pos si bl e f abr i cate pc - boar ds any 1 ar ger t ha 
2' X 2' wi t h r el i abl e yi el d. 

Ihi ng 12 1 ayer pc - boar ds for t he hor i zont al rout ing, astackisexpectedto have a si de 
1 ength of roughl y: 2 X (chi p si de 1 ength) X (nunber of chi ps across si de). Each 1 ayer, i n- 
cl udi ng pc boar ds , r out i ng conponent s , and connect or s , wi'll aM .0.2£e hei ght of a 
s t ack wi 11 her oughl y: 'TbS (number of r out i ng 1 ayer s ). 

A nor e det ai 1 ed des cr i pt i on of 'll ansi t packagi ng t echnol ogy is gi ven i n [ Kni ght 89] . 

1. 5 Fault Tblerance in Hdelta Rmting Stacks 

The net work interface no del describedin Secti on 1. 2 al ong withtheproperties desi gned 
i nt o BN1 al 1 owa re as onabl e 1 evel of f aul t t ol er ance t o be bui It i nt o 'll ansi t bi del t a net works . 

Two i nput s are provi ded fromthe processor or cache-control 1 er to the network. The 
processor is expect ed toutilize onl y one of t hes e t wo i nput s at a gi ven poi nt of t i me. Leavi ng 
t he second i nput unus ed guar ant ees t hat t he net work wi 11 never be conges t ed above t he 50% 
level (Sectionl.2). All owi ng t he pr oces s or to chose whi ch of t he t wo i nput s t o t he net work 
t o act ual 1 y us e at t he begi nni ng of each net work t r ans act i on, prevent s a si ngl e linkbetween 
t he pr oces s or and net work f r ombei ng critic al to the processor’s abi 1 i t y to coimuni cat e 
over t he net work. Wen t wo i nput s ar e provi ded t o t he net work t hr ough s epar at e rout i ng 
conponent s , we guar ant ee t hat no si ngl e conponent failurewill isolateaset of process or s 
fromthe network. 

Wt hi n t he net work, t he s t andar d BN1 configur at i on provi des mil t i pi e out put s i n each 
1 ogi cal di r ect i on. Thi s al 1 ows t he nunber of di ffer ent pat hs t hr ough t he net work t o expand. 

No si ngl e rout i ng conponent i n t he i nt er i or of t he bi del t a net work is critic al si nee there 
i s a conpl ete path f romeach source to each desti nati on whi ch avoi ds any gi ven routi ng 
conponent. 

The final routing stage of the bi delta network is constructed with the BN1 routing 
conponents configured in the alternative manner in which BN1 acts as a pair of single 
output 4x4 crossbars. In this way, the final switching stage for the pair of outputs 
des t i ned t o t he s ame pr oces s or can be di s t r i but edacross twodifferent r out i ng conponent s . 

Si nee any connect i on t o a gi ven pr oces s or coul d be r out ed t hr ough ei t her out put, and hence 
ei ther of the t wo routi ng conponent s i n thi s final st age, nei ther of the r outi ng conponent s 
is cri ti cal. 

A nor e det ai 1 ed des cr i pt i on of t hi s s cheme f or achi e vi ng f aul t t ol erance i n mil t i s t age 
networks i s gi ven i n [ EeHbn 90] . 

1.6 Iht-Ttees 

Fat - TTees are uni ver s al rout i ng net works for i nt er connect i ngprocessors in a mil t i pr o- 
cessor envi ronment [Leiserson 85] . Fat - TTees i nter connect processors usi ng a conpl ete tree 
structure. The processors are 1 ocated at the 1 eaves of the tree whi le the tree’s i nternal node 
conpose the i nt er connect i on net work. Fat-TTees parameterize the bandwi dt h bet ween i n- 
ternal nodes according to their distance fromthe root node. In general, nodes closer to 
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Fi gure 1.5: Ar ea- Uni ver sal Fat - Fee wi t h Cbns t ant Si ze Swi t ches ( Greenberg and Lei sers on) 


the root have greater bandwi dt h t o accoimodat e additional message trade. Connections 
bet ween processor s wi thi n a sub- tree can be nade without cons uni ng bandwi dt h hi gher 
in the tree. Thi s al 1 ows locality to be expl oi t ed whi 1 e r es er vi ng cr i t i cal “1 ong di s t anc 
bandwi dt h f or connect i ons betweenwidely s epar at ed processors. Af at- tree ac hi eves these 
properti es wi th onl y a 1 i near i ncrease i n the nunber of routi ng stages that mist be traversed 
i n t he wor s t case over a conpar abl y sized bidelta net work. 

Fat - TTees are of parti cul ar i nt er es t becaus e t hey can be area- or vol hue- uni ver s al when 
t he bandwi dt h gr owt h t owar dthe root is selected properl y. Af at- tree is vol ume- uni ver s al 
if it can si mil at e any ot her net work cons t r uct ed f r omt he s ame amount of har dwar e wi t h at 
most a pol yl ogar i t hni c slowdown. This proper t y guar ant ees that the hardware dedicated 
t o cons t r uct i ng t he rout i ng net work can be ut i 1 i zed effti ent 1 y. 

In [Geenberg 85] Geenberg and Lei sers on propose the ar ea-uni ver s al fat-tree shown 
i n Fi gure 1.5. The f at -1 r ee s hown demons t r at es a has i c cons t r uct i on s t r uct ur e for f at - 
trees usi ng cons t ant size switching elements. The i nt er nal node capaci t y doubl es on each 
successi ve 1 evel toward the root. This average capaci ty growth i s sufftient to guarantee 
area- uni ver s al i t y f or a quat er nar y f at -1 r ee. 

Fat-TTee networks were developed by Charles Leiserson. He provides a detailed de- 
s cr i pt i on i n [ Lei s er s on 85] as well as proofs for t he uni versalityproperties of fat- trees 
[ Geenberg 85], Leiserson and G eenber g expand t he gener al i t y of f at -1 r ees by pr ovi ng t hat 
the uni ver sal i ty properti es hoi d f or on-1 i ne routi ng al gori thns . 
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2. Bis i c Gbnfigurat i on 


2. 1 Ijrbri d Ikt - TTee ^iproach 

The f at -1 ree net work structurewill beusedtoint er connect s nal 1 TTansi t bi del t a rout i ng 
net wor ks. That is, the leaf nodes of the fat- tree will t hens el ves be bidelta net works rat her 
t han i ndi vi dual processors. This allows a noder at e nunber of processors to be clustered 
together and share uni f orid y short i nterconnecti on paths amongst thenselves. All these 
bi del t a cl us t er s are t hen i nt er connect ed t hr ough t he f at -1 r ee net work al 1 owi ng locality tc 
be expl oi tedbetween adj acent cl us t er s whi 1 e naki ngit possible for any pai r of pr oces s or to 
coimuni cat e i n a moder at el y effci ent manner. 

The term nal clusters are bui It instacks mich 1 i ke t he s t andar d TT ansi t bi del t a net¬ 
work s t acks . The onl y di ffer ence i s t hat abideltacluster stack mis t have s ome bandwi dt h 
bet ween i t s el f and t he f at -1 ree net work rat her t han dedi cat i ng al 1 bandwi dt h i n and out 
of t he s t ack to pr oces s or s . Thi s can be done in a s t r ai ght f or war d nanner as shown di a- 
gr aimat i cal 1 y i n Fi gur e 2. 1. The pr oces s or s connect t o t he bi del t a net work, but i ns t ead of 
consuningall of t he bandwi dt h i nt o t he fir s t stage of the bi del ta net work, they consume 
onl y three - fourth of the bandwi dth. The first routi ng stage routes i n four 1 ogi cal di recti ons 
as before. However , onl ythree of these di recti ons r out e tofurther switchingst ages i n t he 
bi del t a cl ust er. Che of the 1 ogi cal desti nati ons out of t he fir st routi ng st age connect s di rect 
to the fat- tree net work. Thus , t he r emai ni ng rout i ng s t ages wi 11 onl y be t hr ee - quart er s t he 
size of the first routing stage since one-quarter of the interconnect bandwi dth 1 eaves the 
s t ack af t er t he fir s t r out i ng s t age. 



Fi gur e 2.1: H delta Ouster at Leaves of Fat - TT ee 
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2. 1. 1 Al 1 ocati on of Bhndwi dth to and fromEht- See 


I n t he nanner described, t he fat- tree net work consumes one - quart er of t he t ot al band¬ 
wi dt h i nt o and out of the bi del t a r outi ng st ack. Q ear 1 y, the one - quart er of the bandwi dth 
f r omt he first stage of routing to the fat-tree net wor k mis t c one f r omal 1 t he r out i ng com 
ponent s inthe first stage. Nai vel y, it woul dbe possible to connect t he bandwi dt h back from 
the fat-tree network to the bi delta cluster to any of the inputs of the first stage routing 
conponents as all these inputs are logically equivalent; however, it turns out that many 
possible configurations prove not to be optinal. 

Che ext re me woul d be t o connect al 1 t he i nput to the bidelta cl us ter f r omt he f at -1 r ee 
net work t o al 1 of t he i nput s of one - quarter of t he rout ing elements inthe first st age. Thi s 
woul d be non-opt i mal in terms of f aul t -1 ol er ance becaus e an unf or t unat el y pi aced si ngl e 
chip failure in this first routing stage could elininate 8 of the inputs fromthe fat-tree 
net work. In t er ns of rout i ng, t hi s woul d be was t ef ul of bandwi dthtothe fat- tree net work. 

The out put s f r omt he fir s t rout i ng s t age to the fat- tree net work t hat come f r omt he set of 
r out i ng conponent s withall their i nput s or i gi nat i on f r omt he f at -1 r ee net work woul d be 
unus e d. Thi s results bee aus e there is noreasontoroute a c onne c t i on t hr ough a 1 e af no de. 

Q earl y, thi s extreme i s undesi rabl e. 

The al t er nat e extreme is to distri but e t he bandwi dt h f r omt he f at -1 r ee net work t o t he 
cluster over all the routing conponents in the first routing stage. In this scheme, each 
r out i ng chi p i n t he fir s t stage has 2 of its 8 i nput s connect edto the fat- tree net work. A 
si ngl e chi p f ai 1 ur e wi 11 al ways r educe t he bandwi dt h f r omt he f at -1 r ee net work t o t he bi del t a 
cl us t er by t wo. Through each routi ng conponent at most six of the inputs will need to 
be switched in the direction of the fat-tree network. Q ven even connection frequency 
distributions across all processors of requests requi ri ng the fat-tree net work, each pair 
out put s tot he fat- tree will be equal 1 y 1 oaded. 

Abrief cons i derat i on of alternatives between these two extremes nakes it clear that 
the later extreme is the most equitable distribution. Any configuration between these 
two extremes would nake it more likely for some processors to be able to make a non¬ 
local connection than others. Sinilarly, any scheme other than the later extreme would 
necessarily lose more bandwi dt h due t o a wor s t - cas e si ngl e chi p f ai 1 ur e. Thus , t he nos t 
equi t abl e configur at i on i s t he one i n whi ch al 1 t he bandwi dthbetweenthe fat- tree net work 
and thebideltacluster is evenlydistri but ed over all t he rout i ng conponent s conpos i ng t he 
fir s t s t age of r out i ng i n t he bi del t a cl us t er. 


2.1.2 Ihult Tblerance 

The bideltaleaf cl us ter is configured f or f aul t t ol e ranee i n t he s ame nanner as TT ansi t 
bi del t a net wor ks ( Sect i on 1. 5). The pair of inputs fromeach processor s should be dis¬ 
tri buted to di fferent routi ng conponent s. All but the final stage of routingis providedby 
BN1 conponent s i n t he s t andar d 4 cr os s bar configur at ion wi t h 2 out put s i n each 1 ogi cal 
direction. The final output stage is constructed wi th the BN1 conponents configured in 
the alternate nanner such that each pr oces s or receives its pair of net wor k out put s from 
t wo di s t i net r out i ng conponent s . 
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Figure 2.2: Logical Topol ogy of Qaat e r nar y T1 e e 


2. 2 fkt - Ttee Tdpol ogy 
2. 2. 1 Quaternary Ttee 

The appr opr i ate fat- tree to cons t r uct usi ng BN1 i s 1 ogi cal 1 y a quat er nar y f at -1 r ee (Fi g- 
ur e 2.2) as oppos ed to t he bi nary f at -1 r ees pr eval ent i n t he 1 i t er at ur e. Thi s is t he nat ur a] 
selectionsince our rout i ng conponent is a4x4cross bar switch. Usi ng less t han f our di r ec- 
t i ons woul d r equi r e over spec i f yi ng every rout i ng pat h t hr ough t he net work si nee t wo 1 ogi cal 
out put port s f r oma rout i ng chi p woul d act ual 1 y be goi ng i n t he s ane 1 ogi cal di r ect i on. 

Si ni 1 ar 1 y, cons tructing a tree wdtha br anchi ng f act or gr eat er t han f our woul d r equi re mil - 
t i pi e s t ages of rout i ng conponent s t o cons t r uct each vi r t ual rout i ng s t age. The quat er nar y 
tree, of course, has fewer 1 evel s of routi ng f or a gi vennet work si ze than a conparabl e bi nary 
tree. 


2. 2. 2 Separate Up and Ebwi Ttees 

The 1 ogi cal structure of a fat-tree as describedin[Leiserson85], [Q - eenberg85], and 
el s e wher e i nt egr at es up and down rout i ng i nt o a si ngl e tree. For cons t r uct i on pur pos es has ed 
on the Uansit routing conponents and t echnol ogy, it is preferable to actually construct 
this logical structure wi th the up and down routi ng trees separated. Lateral connections 
are pr ovi dedbetweenthe two rout ing trees at everylevel toall ow connect i ons to cr os s - over 
f romthe up routi ng to the down routi ng tree as soon as appropri ate. 


2. 2. 3 Structure 

Thr e fat- tree structurewill be cons tructedinthree-dinensi onal space. As epar at e pi ane 
is allocatedfor eachtree level. Che way toviewthis, is tot ake t he 1 ogi cal s t r uct ur e showr 
i n Fi gures 2. 2 and rai se each i nternal tree node up to a pi ane equi val ent to i t s hei ght i n the 
tree. Fi gure 2. 3 shows thi s three- di nensi onal nappi ng of the tree structure. 
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Figure 2. 3: Three-11 mensi onal Vi ew of Tie e Structure 


2.2.4 Fj^wird Unit i ng 

BNl i s a cr os s bar s wi t chi ng el ement. Si nee BNl has mil t i pi e out put s i n each 1 ogi cal 
di r ect i on, it does s one amount of concent r at i on. Its pr i nary s t r engt h, t hough, i s i n rout i ng. 
W can t ake advant age of t hi s by rout i ng i n t he up t r ee as wel 1 as t he down tree. Eat her 
t han act ual 1 y goi ng t hr ough ever y 1 ogi cal t r ee 1 evel on t he j our ney up the tree to the de sired 
hei ght, s one r out i ng i s done t o al 1 ow t he up rout i ng pat h t o short - cut ar ound s one tree 
stages. Wththiss hor t - cut t i ng, we don’t need an up rout i ng s t age for every 1 evel of t he t r ee. 
Usi ng BNl whi ch di s ti ngui shes f our routi ng di r ecti ons , each up routi ng st age can rout e a 
connect i on t o one of t he next t hr ee 1 at er al cr os s overs i n t he tree or t o t he next up rout i ng 
stage inthe tree. The next up rout i ng s t age wi 11 per f or ms i ni 1 ar 1 y. In t hi s manner, one up 
r out i ng s t age is needed f or e ver y t hr ee 1 evel s of t he tree. Fi gur e 2. 4 shows a cross-secti ona 
viewof an upward router and i t s logical connections to conponents i n t he up and down 
rout i ng trees. Thi s cross-secti on spans t hr ee 1 ogi cal tree levels whi ch ar e r eal i zed as oi 
physical up rout i ng stage and three physi cal down rout i ng s t ages ; the cross-sect i on shows 
onl y a si ngl e conponent at each up and down rout i ng st age. 

Utilizing the rout i ng capabi 1 i t y of BNl i n t hi s narnier, r out i ng up t he f at -1 r ee i s accom 
pi i shed i n at iiD S°fg N st ages as opposed toth^lM^iat woul dbe requi red i f no routi ng 
were performedinthe up tree. At the s ame t i me, BNl still perform s one concent rat i on 
at each upward rout i ng stage but not full conceiA fUtaitohs, at each stage inthe up 
routing tree, all inputs destined for the same parent node cannot end up on any wire in 

1 leiserson did point out that if the routing component were mclfied slightly to allowa specification of 
“route to height x ” in the fat-tree in addition to “route in direction y” it wouldbe possible to specify large 
heights inthe fat-tree wthonly a snail mriier of bits and offer rare freedomin terra of concentration 
and bandwidth allocation [Ieiserson 89]. 
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(from upstream down router) 




(from previous up router or bidelta cluster) 

(to other down routers as appropriate) 

Fi gure 2. 4: doss- Secti on Vi ewof Up and Dawn Bouti ng Tbees 


t he up t he tree connect ed tot hat parent node. Each up rout i ng s t age al 1 ows a gi ven i nput 
wire to be connected to any of tw) vires ineachlogical di recti on when wiredproperly; a 
concent r at i on of t wo i s effect i vel y achi eved at each upward rout i ng stage. The best known 
means of achi evi ng f ul 1 concentr ati on at each s t age wi 11 cost f^l og ti me and har dwar e 
[ Aj t ai 83] [ Cbr men 86] whereas BN1 performs its linited concent rat i on i n cons t ant t i me 
and har dwar e. In t hi s nanner , t he concent rationinthis fat- tree configur at i on i s mich 1 i ke 
the fat- tree with cons t ant size switches of Leiserson and deenber g [ deenber g 85] s hown i n 
Fi gure 1. 5. The nunber of si gnal s i nto each 1 evel up the fat- tree does growbecause fan- i n 
occurs frommDre and more stages. 

Fi gure 2. 5 shows what the up t ree connecti ons 1 ook 1 i ke i n three di mensi ons . The r outi ng 
conponent s are at t he base. The connect i ons t o f ur t her up rout i ng s t ages are shown s t r ai ght 
up fromeach routi ng conponent. Fi gure 2. 6 shows hori zontal cross- secti ons of Fi gure 2. 5 
to nake the convergence and routing clear. Layer 0 is the layer of routing conponent s. 
Layers 1 through 3 showwhere the 1 ogi cal connecti ons fromeach routi ng conponent fan- i n 
to the lateral connecti on of the next three crossover stages. The f our t h 1 ogi cal connection 
out of t he rout i ng conponent s , of course, connects di recti y upward to the next up routing 
stage. 


Qiick Uniting for Very Large System 

11 i s act ual ly possible to do better tte|m-t hteages of up rout i ng j us t des cr i bed. 

By us i ng a series of rout i ng s t ages to switcha connecti on to t he naxi numhei ght up t he 
tree to whi ch i t needs to be routed, the connection to the appropriate lateral crossover 
between the up and down routi ng tree can be made i p ifogiunber of 1 evel sin tree. 

Since there ar e J(o^f 1 evel s in the tree, this means oqllp^^-^r stages are needed 
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Figure 2. 5: Three-II mensi onal Mew of Connections fromChe Up Bout i ng Stage 

t o per f or mupuar d r out i essence, this rout i ng s cheme bui 1 ds a bi del t a net work from 

t he pr oces s or s t o t he var i ous treelevels. Fi gur e 2. 7 depi ct s how an bi del t a net work can be 
used to reach any hei ght i n a tree of depth 16 i n onl y 2 stages of routi ng. 

However, thi s up routi ng scheme i s onl y of i nterest when the fat- tree has a 1 arge nunber 
of tree s t ages . Inessence, weareal ready get t i ng t he benefit f r omt hi s s chene achi evabl e 
when t he s i ze i s on t he or der of 4 1 e vel s . The poi nt where i t act ual 1 y becomes interesting 
touse this sc he me occur s when t he nunber of 1 evel s is r oughl y 4 X 4 =16. In our s cheme 
however, 16 tree levels i npl i es a net work s upplfirbi dgl4 a clusters eachwith48 to 
192 processors. Thus even in the the smallest case we will 'fiacre flbcdit 200 
bi 11 i on pr oces s or s . The capaci t y anal ysi s and geomet r y r equi r ement s for t hi s case are not 
consi dered here si nee thi s is cl earl y not a scheme of current practi cal i nterest. 

2. 2. 5 Ebwi Rout i ng 

The down r out i ng pat h i s si npl y t he s t r ai ght - f or war dtree structure. Each down r out i ng 
stage switches inf our 1 ogi cal di r ect i ons among i t s f our sub- trees. The 1 ogi cal down rout i ng 
pat h 1 ooks j us t 1 i ke t he 1 ogi cal tree structure shown i n Fi gur e 2. 3. Mis t of t he i nput s t o a 
down rout i ng s t age come f r omi t s parent i n t he f at -1 r ee. The rest of t he i nput s come from 
thelateral cross - over fromthe up routi ng tree as descri bed i n the previ ous secti on. 

2.2.6 Efesired Capacity Growth Rite 

Che of t he ni ce pr oper t i es whi chf at- trees can have i s vol ume- uni ver s al i t y. Thi s prop- 
ertyessentiallystates t hat a fat- tree fat- tree canefftientlysi mil at e any ot her net work oi 

2 N.B. It wll alvays take this rany stages regardless of the height routed by such an up routing tree. 
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Figure 2. 6: Horizontal Gos s- Secti ons Che Up Bouti ng St age 


conparabl e vol ume wi th at most a pol yl ogari thni c si owdown [ Lei ser son 85] . Addi ti onal 1 y, 
vol une - uni ver s al i t y i npl i es t hat 1 ar ger f at -1 r ees can be cr eat ed by s i npl y addi ng 1 evel s and 
s cal i ng t he f at -1 r ee s t r uct ur e i n t he t hr ee - di mens i onal wor 1 d. 

To assure vol une-uni ver s al i t y, the bisection bandwidth mist be properly correlated 
with the volumes cont ai ned wi t hi n each por t i on of the network. For three dimensional 
structures, bandwd dth i nto a vol ume i s general 1 y proporti onal to the surf ace area of the 
end os ed vol ume; t hi s vol ume will inturnbe pr opor t i onal t o t he nunber of t er ni nal nodes 
or net work end poi nt s f r omwhi ch t he vol ume i s conpos ed. Thus we as s ume: 

• \bl ume =t) oc network endpoi nt s . 

• Bandwi dth oc Surf ace area =a 
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Tree Levels 



Fi gur e 2.7: Hybri d Up Bout i ng Scheme wi t 1 og jVSt ages i n Up Bout i ng Idee 
• Surface area =aoc.yj(v) 2 

Ihi ng ras sone measure of t he nunber of net work endpoi nt s , and aas sone measure of t he 

network bandwi dth, the i nportant rel ati on i (If we assume that the end osed 

volume increases by a factor of n 4=(4w n _i) at each level, as one night expect for a 

quat er nar y tree, it is clear t hat vol ume uni ver s al 1 i t y can be achi eved onl y i f t he bandwi dt h 

i ncr eas es no f as t er t han t he f act or der i ved i n Equat ion 2. 1. 

an oc (^) 2 =( yi^T) 2 =( ^Tf{Vl) 2 =« n4 (v^i) 2 =a n 4 ^16 (2.1) 

Thus, on average, bandwi dth shoul d i ncrease by a f a^!6raibfeach successi ve 1 evel up 
the tree to naxi ni ze t he bandwi dt h avai 1 abl e i n t he net work whi 1 e keepi ng t he gr owt h r at e 
appropriately bounded. Wether this bandwi dt h i ncr eas e can actual be realized within 
t he as sumed f act or of f our gr owt hrate is still an open quest i on. A gener al i zat i on of t he 
uni ver s al 1 i t y proof s i n [ Lei s er s on 85] and [ Q - eenber g 85] to t hi s r at e of gr owt h nay be 
possible, but has yet to be demonstrated. 

2.2.7 Channel Capacity Growth 

It is easy to see the average channel capacity, nunber of distinct physical connections 
i n or out of an internal tree node, growth by looking at the channel capacity growth 
across three tree levels. For si npl i ci t y, cons i der t he por t i on of a f at -1 r ee net work shown i: 
Figure 2.6. The channel capacity of each logical channel i n and out of the bottomof this 
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Thbl e 2. 1: Up TT e e Bandwd dt h M 1 o c at i on 

structure is eight. The channel capacity of the si ngl e 1 ogi cal channel i n and out of the top 
of this structure is 128. Thus , t he channel capaci t y has grown by a f act or of 16 acr os s 3 
tree 1 evel s gi vi ng the desi red average giy^M-h of 

Si nee the up routi ng and down routi ng trees are separate, the channel capaci ty growth 
i n t he up t r ee di ffer s s one what f r omt he capaci t y of t he t he cor r espondi ng net work stages 
i n the down tree. 

Ifj Thee 

Tb see howthe up tree routing capacity grows, it is easiest to look at the channel 
capaci ties for the first f ewstages of the network. Cbnsi der the porti on of the up routi ng tree 
s hown i n Fi gur es 2. 5 and 2.6. Layer 0 si npl y perform ther out i ng to t he next t hr ee 1 ayer s 
s o i t does not correspond t o an act ual levelinthetree. Each base rout i ng conponent ini ayer 
0 i s an BN1 rout i ng chi p. Thus each of the 64 r outi ng conponent s shown i n Fi gur e 2. 6( A) 
has 8 i nput s and out put s . The 8 out put s are di vi ded i nt o f our 1 ogi cal di r ect i ons wi t h t wo 
goi ng to each of the next three 1 ayers and 2 goi ng further up the tree. Thus 1 ayer 1 has 64 
pai r s of out put s conver gi ng to 16 di ffer ent sub- trees (Fi gur e 2.6(B)); theresult is 161 ogi ca 
channel s of capaci ty 8. Si ni 1 arl y, 1 ayer 2 has 64 pai r s of output s convergi ng i n 4 sub- trees 
Iraki ng 4 1 ogi cal channel s each of capaci ty 32 (Fi gure 2.6(C)). Fi nal 1 y, 1 ayer 3 has 64 pai r s 
of out put s convergi ng t o a si ngl e sub- tree whi ch gi ves a s i ngl e 1 ogi cal channel wi t h capaci t y 
128 (Figure 2.6(D)). This leaves t he r enai ni ng 64 pai r s of outputs whi ch connect to the 
next routing stage as a s i ngl e 1 ogi cal channel wd t h capaci t y 128. This logical channel will 
then encount er another rout i ng s t age j ust 1 i ke 1 ayer 0 and the progres si on wd 11 conti nue i n 
thi s narnier. Thi s capaci ty progressi on i s sumrari zed i n Thbl e 2. 1. The final i tens i n the 
tabl e are the correspondi ng 1 ayer 0 and 1 progressi ons for the next routi ng stage. Thus the 
pr ogres s i ons of gr owt h f act or s is 1, 4, 4 gi vi ng a capaci t y gr owt hof 16in3tree st ages or 
an average growth o$ / T6. 

Ebwiwir d 

The hot t omiiDS t down rout i ng level mis t pr ovi de 8 out put s ineachlogical desti nat i on 
to s cal e properl y wd t h i nput t o t he up r out i ng tree. Thi s di ct at es 8 • 4=32 out put s or 4 
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Thbl e 2. 2: Dawn Thee Bandwi dth M 1 ocati on 

r out i ng c hi ps . Thi s i npl i e s t hat t he t ot al f an- into this level is also 32. 2 - 4 = 8 oft ha 
f an- inis cons uned by 1 at er al connect i ons 1 eavi ng 32 — 8 = 24 i nput s whi ch mis t f an- i n 
f r omabove. At t he next 1 evel , t here need t o be 24 out put s for eachlogical desti nat i on as 
just deternined. This dictates 24- 4 =96 outputs or 12 routi ng conponent s and a total 
f an- i n of 96. There are 2 • 4- 4=321 at eralf an- i n si gnal s tot hi s level, 1 eavi ng 96 — 32 =6' 
i nput s f or f an- i n f romt he pr evi ous down routi ng st age. Si ni 1 ar 1 y each 1 ogi cal desti nati on 
at 1 evel t hr ee has 64 out put s associatedwithit. Thi sdictates 64- 4 =256 total out put s 
or 32 rout i ng chi ps . Thi s gi ves a t ot al f an- in to t hi s level of 256. The 1 at er al f an- ini 
2- 4- 4- 4=1281 eavi ng 256 — 128 =128 i nput s f or f an- i n f r omabove. 

Beyond this point, the structure replicates. The next level up 1 ooks like the bottom 
level scaledsot hat t he i nput size in each di recti on nat ches t he out put f r omt he pr evi ous 
stage, pr ogres si on nay continue in this nanner ad infiritum. The pr ogres si on of channel 
capaci ties i n the downward route are thus as shown i n Thbl e 2. 2. Level n+4 i s equi val ent 
to the level nstage since the series begins to repeat at that point. Thus the progressions 
of gr owt h f act or s i §,3£, gi vi ng a capaci t y gr owt hof 16 across 3 tree levels. Thi s agai n 
gives the desi red aver age growt$16f This is identical to the growth in the upward 
routi ng tree, but the growth does not coi nci de wi th the upward tree on a stage per stage 
basis. 

2.3 Wiring Cbnstraints for Efficient Bhndwidth List ri but ion 

I n hot h t he up and down rout i ng por t i ons of t he f at -1 r ee net work, a 1 ogi cal rout i ng s t age 
i s conpos ed of nany s e par at e rout i ng conponent s . The i nput s t o any 1 ogi cal r out i ng s t age 
cone f romdi fferent logical directions; that is the inputs to the logical routing stage nay 
be lateral crossover connections f romdi fferent subtrees or nay cone fromthe imrediate 
parent routing stage. The distribution of these inputs to physical routing conponents 
pres ent s a nor e general case of t he i nput bandwi dt h al 1 ocat i on di s cus s ed in Sect i on 2. 1. 1. 
Inthis secti on, the i s sue of bandwi dt h di s tri buti on t o r outi ng conponent sis exam ned i n 
more det ai 1 i n or der t o devel op s one cons t r ai nt s for t he gener al pr obi emof opt i rial 1 y wi r i ng 
the fat-tree network. 

Agai n, we s t ar t by exam ni ng t he t wo ext re ire cases for bandwi dt h di s t r i but i on. I us e 
t he wor d dspersicnt o refer tot he degree to whi chi nput s f romdi ffer ent 1 ogi cal di r ect i ons 
are di s t r i but edacross distinct rout i ng conponent s. Elspersionis closelyrelatedtothe 
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Figure 2. 8: Non-II sper se Exanpl e 


not i on of ecpmoni n t he t heor y 1 i t er at ur e. 

2.3.1 N>Qspersion 

In the non-disperse case at each s uch 1 ogi cal routing stage, the inputs froma given 
di r ect i on ar e each rout ed to their own set of rout i ng conponent s . That is, if t he i nput s 
froma gi ven di r ect i on make up a f r act i on a of t he i nput s i nt o t hat 1 evel, t hen t hes e ar e al 1 
destined for an equivalent fraction of the routing conponents at that level. Components 
will be shared bet ween 1 ogi cal input directions only in the case that there is an uneven 
division of conponents among t he i nput directions. 

At any stage, i nput s froma gi ven di recti on can onl y make use of aof the bandwidth 
to each logical destination. This means the inputs fromthat stage can, at most, reach 
aof the routing conponents at the next level. This effect i vel y ni ni ni zes the number of 
different conponents, and hence paths, reachable; This is non-optimal for f aul t -1 ol er ance 
and ni ni ni zi ng routi ng congesti on. Ch the posi ti ve si de, at 1 east aof the bandwi dt h i n 
a gi ven di r ect i on is guar ant eed to the i nput s f r omeach i nput di r ect i on; si nee t he i nput s 
froma si ngl e 1 ogi cal di rect i on do not, i n gener al, share rout i ng conponent s wi t h i nput s 
from other directions, those inputs are guaranteed the entire aof the bandwidth. Nan¬ 
di s per si on me ans that at each level, inputs fromthe same direction are conpet i ng onl y 
wi th each other for the bandwi dth t o the next 1 evel. 


Exanpl e Cbnsi der Fi gur e 2. 8 f or a concrete exampl e of t hi s wi r i ng s t r at egy. Thi s s wi t ch- 
i ng coul d r epr es ent Level 1 of t he f at -1 r ee i n t he downwar d r out i ng pat h. The 1 ogi cal r out i ng 
s t age i s t hus conpos ed of f our rout i ng conponent s as shown. Inthis case, 8wires enter from 
the 1 ateral di recti on whi 1 e 24 enter fromthe previ ous down routi ng stage i n the downward 
r out i ng t r ee. H ght wi r es 1 eave level lineachofthe41 ogi cal di r ect i ons . 

Here, the 32 wi res are parti ti oned based on thei r di recti on of ori gi nati on. The 8 1 ateral 
wires all gointoasi ngl e rout i ng chi p. The 24 wi r es f r omf art her up t he tree are connect ed 
t o t he r emai ni ng 3 r out i ng chi ps . Of t he 8 1 at er al connect i ons t hr ough t hi s 1 evel , at most 
t wo connecti ons can be made i n each 1 ogi cal output direction; however, if more than two 
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Fi gure 2. 9: II sperse Exanpl e 


1 ateral connecti ons need to be nade i n a parti cul ar di recti on, at 1 east 2 of the connecti ons 
vi 11 be nade. Asi ngl e chi p f ai 1 ur e at t hi s 1 evel voul d ei t her prevent 1 at er al comruni cat i on 
enti rel y or cut down the bandwi dth fromf arther up i n the tree by 33% 

2.3.2 Rill Qspersion 

Ful 1 di sper si on i s achi eved when t he i nput vires toeachlevel f r oma gi ven i nput di r ect i on 
are spread out amongst as nany different routing conponents as possible. The pair of 
out put s f r omt he s ame di r ect i on of asi ngl e chi p at a pr evi ous 1 evel al ways go to di ffer ent 
routing conponents. 

The nunber of potential paths frompoint to point expand at eachlevel. This results 
because each i nput vi re f roma 1 ogi cal i nput di recti on has the opportuni ty t o 1 eave on any 
of t vo out put s ; at t he s ane t i me, when a connecti on f r oma 1 ogi cal i nput di r ect i on does exi t 
the routi ng 1 evel vi a a parti cul ar output port, i t i s not di nini shi ng the bandwi dth pot enti al 
for any other inputs fromthat same logical direction. Thus inputs fromall directions 
conpet e for t he bandwi dt h t o t he next 1 evel . Toalinited ext ent, t hi s al 1 ows aver agi ng of 
the 1 oad fromseveral input directions across the total available bandwidth. Sinilarly, the 
effects of conponent f ai 1 ur e i s spr ead evenl y over al 1 logical input directions. Wen a chi p 
fails, it effectively reduces a f r acti on of the f an- i n f romeach 1 ogi cal di recti on. 

Ekanple Figure 2.9 gives the concrete exanpl e for the full di sper si on case. This cor¬ 
responds to the di sperse vi ri ng of the same 1 evel 1 downvard routi ng path gi ven i n the 
exanpl e of the previous section. 

In thi s case, the 32 vi res are di stri buted evenl y, based on di recti on of ori gi nati on, acros 
t he 4 rout i ng chi ps . If all 8 1 at er al connecti on ver e des t i ned f or t he s ane out put di r ect i on, 
there is a possibility t hat t hey coul d al 1 be r out ed; there is also the possibility t hat t h 
coul d al 1 be bl ocked because all the bandwi dth i n that parti cul ar di recti on i s consumed by 
connecti ons fartherupthetree. Asi ngl e chi p f ai 1 ure at thi s 1 evel voul d effecti vel y cut th< 
bandwi dth out of this level down by 25% 
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2.3.3 Ibtential Rith Growth 


fiomthe above, to see that f ul 1 di sper si on wi 11 naxi m ze the nunber of paths avai 1 abl e 
for a connecti on through the network. This factor of two increase at each level in the 
nunber of paths over vkicha given connection canbe imde is the best achi evabl e us i ng 
BN1. Thi s occurs when each wi re on whi ch a connecti on coul d pot enti al 1 y be nade by 
a coimon source i s connected to a di fferent routi ng conponent. There are then t wi ce as 
nany pot ent i al pat hs t owar dthe de sired desti nat i on out of each rout i ng 1 evel as there were 
i nto the 1 evel . As 1 ong as thi s property can be nai ntai ned, each routi ng 1 evel wi 11 be abl e 
to achi eve t hi s effect i ve doubl i ng of t he nunber of pat hs between a given s our ce and its 
desti nati on. 

Certainly, it is possible to wire the network with less potential paths between each 
s our ce desti nat i on pai r. If t wo or nor e wi r es whi ch a gi ven s our ce coul d have rout ed a 
message through are connected to the same routing chip, then the path growth will be 
s rial 1 er. 

I n or der to guar ant ee t hat we can achi eve t he naxi rial pat h gr owt h or pat h expansion 
descri bed above, at each 1 evel, i t need onl y be the case that there i s not a si ngl e set of 
wires that canbe reached froma single processor that accounts f or ^mofettihan 
t ot al bandwi dt h i nt o any 1 evel. Thi sis not t he s ame as s ayi ng t hat t he i nput s froma si ngl e 
logical di recti on mist conpos e 1 es ^ tatfahhe i nput s t o a gi ven 1 ogi cal routi ng 1 evel. 

The i nput s froma gi ven 1 ogi cal di r ect i on ar e not neces s ar i 1 y f ul 1 y concent rat ed when t hey 
ent er a rout i ng 1 evel. As 1 ong as t he cons t r ai nt gi ven above on t he f r act i on of pot ent i al 
i nput s f r oma s i ngl e s our ce can be nai nt ai ned, each i nput wi r e froma pot ent i al 1 y common 
source can be routed through a di fferent routi ng conponent at any 1 evel i n questi on. For 
the fat- tree r out i ng s t r uct ur e des cr i bed her e, t hi s pr oper t y i s mai nt ai ned at each 1 ogi ca! 
routing level conposi ng the net work. 

Lei ght on and Miggs [ Lei ght on 89- 2] point out that the above constraint is suffcient 
onl y to guarantee opt i mal pat h expans i on at t he ni cr o-1 e vel; that is when 1 ooki ng at the 
nunber of pat hs betweenasi ngl e s our ce and desti nat i on. Wen vi ewi ng pat h expans i on at 
t he macro-1 evel, additional cons t rai nt s are neces s ary t o guar ant ee opt i mal pat h expans i on; 
macro-level path expans i on i s the growth in the nunber of paths between sets of source 
and des t i nat i on pr oces s or s rather t han s i npl y i ndi vi dual processors. Lei ght on and Mggs 
demons t r at e t he advant ages of optimal or near-opt i mal macro-level expans i on wi ri ng when 
with a scheme where the switching elements can arbitrate with one another concerning 
net work 1 oadi ng [ Lei ght on 89-1] . BN1, however, selects output paths obliviously. As such, 
i t i s not cl ear t hat t hei r macro-1 evel cons t r ai nt s wi 11 have any si gni ficant affect i n pr act i c 
on t he rout i ng per f or nance of a net work bui It us i ng BNl switches. Be cent work [ LteHbn 90] 
woul d i ndi cat e t hat t hi s ki nd of expans i on i s us ef ul to naxi ni zi ng t he f aul t t ol e ranee of 
t he net work. 

The complexity of Leighton and Miggs’ switching element [ Lei ght on 89-1] is much 
greater than that of BNl. As such, it would certainly represent a much slower conpo¬ 
nent . Addi t i onal 1 y, t hei r s wi t chi ng s cheme r equi res appr oxi nat ely4(logAf clock eye les to 
rout e a connecti on as oppos ed tot he (logl^ clock cyles r equi red by BNl. Bot h of t hes e 
pr oper t i es make its rout i ng much si ower t han BNl. However, if its r out i ng efffci ency t ur ns 
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out t o be great er t han t hat of t he obi i vi ous rout i ng done by RN1 by a conpar abl e f act or, 
t hen perhaps i t woul d be benefici al t o 1 ook f ur t her i nt o t he al t ernat i ve of cons t r uct i ng such 
a s wi t chi ng el enent. 
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3. Gbnst ruction 


This chapter expands on the technical details i nvol ved i n t he cons t r uct i on of the network 
structure describedinthe previ ous chapt er. Section3. 1 discusses afewfurther issues about 
t he cons t r uct i on of t he bi del t a cl us t er s at t he 1 eaves of t he f at -1 r ee. Sect i on 3. 2 addr es s 
a coupl e of i s s ues i npor t ant to t he r eal i zat i on of a s uch net work us i ng BN1. Sect i on 3. 3 
i ntroduces the irit tree, a stack structure that can serve as the basi c bui 1 di ng bl ock for thi s 
ki nd of fat - tree network. Then Section3.4describes apossible geometry for arrangi ng these 
uni t tree structures inthree-dimensi ons to realize net works of r eas onabl esize. Section3. 
f ol 1 ows des cr i bi ng cons t r uct i on i s s ues for t hi s geomet r y. Section 3.6 then deals with the 
1 ong wi r es t hat necessarily occur inthis structure. Section 3.7 briefly details processc 
placement withrespects to sucha net work. Fi nal 1 y, Sect i on 3. 8 det ai 1 s an effci ent s cheme 
for conputi ng rout i ng sequences for this network. 

3.1 HdeltaLeaf Ousters 

Sect i on 2. 1 des cr i bed t he conpos i t i on of t he bi del t a 1 eaf cl us t er s . Thi s sectionfills i 
cons t r uct i on det ai 1 s not r eadi 1 y apparent fromthe previous description. 

3.1.1 Gbmiect i ons Tblht-'ffee 

For s i npl icity, t he 111 ogi cal rout i ng di r ect i on out of t he fir s t s t age of rout i ng in t he 
bideltacluster is chosento rout e into the fat- tree net work. That is, if t he first two r out i n 
bi t s of the routi ng sequence are both ones , the connecti on wi 11 be r out ed out of the bi del t a 
cl us t er and i nt o t he f at -1 r ee net work. The choi ce of whi chi ogi cal di r ect i on t o use f or t hi s 
pur pos e i s s omewhat ar bi t r ar y, but it is useful to settle on a si ngl e di recti on for referenc 
purpos es. 

3.1.2 Gbmiect ions ItomEht- Tfree 

The first stage routi ng conponent s wi 11 be configured to swal 1 owthe first routi ng byte 
t hey encount er, as describedinSectionl.3.2, for cl us ter i nput s f r omt he f at -1 r ee net work 
This means the s wal 1 ow proper t y mist be set only on the two inputs fromthe fat-tree 
net work to each rout i ng conponent i n t he first rout i ng s t age; t he ot her s i x i nput s to each 
rout i ng conponent wi 11 come di rect 1 y f romproces sor s and not requi re t hi s s wal 1 owpr oper t y 
to be set. Inthis nanner , a f r esh r out i ng bytes i s al ways us ed to rout e t hr ough t he bi del t a 
cl us t er r egar dl es s of t he s our ce of t he connect i on. Thi s pr ovi des a 1 ogi cal s e par at i on bet wee ] 
t he port i on of t he r out i ng s equence us ed t o rout e t hr ough t he f at -1 r ee and t hat us ed t o rout e 
withinthe cl us ter naki ng rout i ng cal cul at i on raider at el y easy. It al s o guar ant ees t hat a 
fresh set of rout i ng bi t s is avai 1 abl e t o rout e t hr ough t he bi del t a cl us t er. 


23 


3. 1. 3 Si ze 


The nunber of proces s or s t hat a si ngl e bideltaclusters stack can s uppor t i s cons t r ai ned 
by physical size, bandwidth, and gr anul ar i t y. Since the routing conponent distinguishes 
four distinct directions, each additional routing stage will increase the size, nunber o 
processors, by a factor of four. Nate, however, that since one logical direction out of the 
first stage connect s to the fat-tree rather than the bi del ta network, that the first stage onl y 
di s t i ngui shes three di recti ons wi t hi n t he net work. Thi s gr anul ar i t y cons t r ai nt al one s pec i fie 
the cons t r uct i on of clusters whi ch suppbptdfces4or s , for any non- negat i ve i. Wen 
cons i der i ng physi cal size we are linitedbythe size we can cons t r uct t he hor i zont al r out i ng 
boar ds (see Sect i on 1. 4). Cbnsi der i ng t he pr oj ect ed si ze of BN1, the largest possible si ngl < 
hori zont al routi ng pi ane we can const ruct wi 11 support 8 X 8 =64 routi ng conponent s ; t hi s 
woul d al 1 ow t he cons t r uct i on of the 4 stage network s uppor^; i=n£92 -prloces s or s . 

Addi t i onal 1 y, t he bi del t a cl us t er wi 11 have t o pr ovi de bandwi dt h t o and f r omt he f at -1 r ee 
net work whi ch wi 11 nat ch t hat of t he f at -1 r ee net work bei ng cons t r uct ed. 

For notational convenience, a bi del t a cl us t er s supporti ng nprocessors will be denoted 
by B n . eg. a 3 st age bi del t a cl ust er s wi 11 supj?or44SB prices sor s and wi 11 thus be 
referenced as^sB 

3.2 Inter-Stage Efetai 1 s 
3.2.1 I|j Rit h 13 rect i ons 

Bout i ng conponent s i n an up r out i ng s t age r out e i n f our distinct di recti ons as shown 
i n Fi gure 2.4. Here, agai n, t he as si gnment of actual rout i ng di recti ons t o each of the f our 
1 ogi cal di r ect i ons i s s one what ar bi t r ar y. It is us ef ul, t hough, t o nake t hi s as s i gnnent i n a 
1 ogi cal manner for t he s ake of references and rout i ng si npl i ci t y. The 11 di recti on is as si gnec 
to the di recti on whi ch routes further up the tree. Thi sis consi stent withthe choi ce of the 11 
di recti on tor out e out of t he bi del t a cl us t er. The 00, 01, and 10 di recti ons eachrespectivel 
rout e t o t he each of t he next t hr ee succes si ve down rout i ng s t ages . That is the 00 di recti on 
rout es to t he next down r out i ng s t age, 01 to its parent, and 10 to t he next parent. 

3. 2. 2 Swal 1 ovss 

I n or der t o al 1 owr out i ng to ar bi t r ar i 1 y nany des t i nat i ons , we need t o nake s ur e t hat 
t he f at -1 r ee net work i s configured to di s car d an ol d r out i ng byt e when i t i s exhaus ted. Thus 
t he svdlows t ages need to be arranged i n a f unct i onal nanner. Thi s t ask i s conpl i cat ed 
by t he fact t hat connect i ons can be nade t hr ough ar bi t r ar y hei ght s inthe tree sucht hat 
conpl et e pat hs f r oms our ce to des t i nat i on are not ent i r el y homogeneous ; t hat i s , pat hs wi 11 
di ffer ini engt h bas ed on wher e theyenteredthe down r outi ng tree. 

I deal 1 y we woul dlikeeachof the s wal 1 ows t ages t o be pi aced a a naxi mimdi s t ance from 
each ot her. Thi s is desi r abl e becaus e each s wal 1 ows t age effectively costs one clock cycle of 
del ay whi 1 e t he ol d rout i ng byt e is bei ng di s car ded s o t he new can t ake its pi ace. Thus t o 
nininize the time to nake a connect i on, s wal 1 ow stages shoul d be pi aced as i nf r equent 1 y 
as possible. 
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Si nee t here are 8 bi t s i n each rout i ng byt e and each rout i ng s t age consumes 2 bi t s , 
s wal 1 ows are r equi red at least every f our r out i ng s t ages . Swal 1 ow stages will certainlybe 
needed at 1 eas t ever y 4 s t ages on t he up pat h and on t he down pat h. 11 i s pos si bl e to change 
f r omup r out i ng t o down r out i ng at any up s t age. I n or der t o be as s ur e d t hat t he down 
r out ing stages get fresh r out i ng byt e s whe n ne e de d, t he ol d r out i ng byt e will be stripped 
off as part of the lateral crossover fromthe upward tree to the downward. This routing 
byt e change will be realizedbysettingthe s wal 1 owpr oper t y on al 1 1 at er al i nput s t o a down 
rout i ng s t age. Thi s 1 at er al s wal 1 owpr ovi des a cl ear s e par at i on bet ween t he up and down 
routing portion of a connection. 

3.2.3 Ht Rotations 

Be c al 1 f r omSe ctionl. 3 t hat t he appr opriate bits fr omt he dat a pat hare usedtoperform 
rout i ng. In or der t o as s ur e t hat a di ffer ent pai r of bi t s are us ed at each 1 evel , t he dat a pat h 
is rotatedbetween rout i ng 1 evel s . Thi s poses a sinilar pr obi emt o t he s wal 1 ow. We r eas , 
i n a bi del t a net work, al 1 pat hs i n t he net work ar e of t he s ame 1 engt h, al 1 pat hs t hr ough t he 
f at -1 r ee are not. M>r e i npor t ant 1 y, all pat hs are not even of t he s ane 1 engt h nodul o f our. 
Even paths through a gi ven routi ng component can di ffer i n 1 ength. Thus, sone attenti on 
mis t be gi ven t o as s ur i ng t hat t he bi t s are r ot at ed uni f or id y t hr ough t he tree net work. 

IT omt he s our ce pr oces s or t o a gi ven hei ght inthetree, the amount of r ot at i on i ncur r ed 
will be known. Sinilarly, fromthat height down t o a des t i nat i on, thenunber of rotations 
will be known. Wat we mist assure is that the total nunber of bit rotations modulo 
f our t hr ough t he net work is t he s ane re gar dl es s of t he hei ght at whi chthelateral cross over 
occurs. In or der to guar ant ee t hi s , we mis t usethelateral cross over bet ween t he up rout i ng 
tree and t he down rout ing tree to s huffb t he bi t s i nt o a cons istent state; t hat i s nake sure 
t hat t he 1 at er al bi t s i nt o a gi ven down rout i ng 1 evel have t he s ane r ot at i on appl iedto them 
as t hos e coni ng i nt o t he down rout i ng 1 evel f r omabove. Thi s is closelyrelatedtoneedto 
s wal 1 ow on t he 1 at er al cr os s over i n or der t o be s ynchr oni zed with the poi nt of ent r y i nt o 
the downwar d r out i ng pat h and si ni 1 ar 1 y guar ant ees that, regardless of the history of a 
connection’s path, it merges consi st entl y i nt o the downward tree. 

3. 3 Uii t Thee 

I n bui 1 di ng t he f at -1 r ee net work, we are cons trainedinthe size we canreli abl y f abr i cat e 
the conponent stacks. Becall fromSection 1.4 that the horizontal pc-boards are United 
t o about t wo f eet s quar e. There ar e al s o ot her di ffcul t i es t hat arise if we att enpt t o bui 1 d 
1 ar ger s t acks . The on- boar dwire lengths beginto become a nor e s er i ous concern s 1 owi ng 
the clock cycle of eachr out i ng s t age. The al ready dens e wi r i ng on t he pc - boar ds becomes 
proporti onal 1 y more severe. 

It is clear t hat t he f at -1 r ee net work wi 11 have t o be packaged i n mil t i pi e conponent 
s t acks . Fr omt hi s r eal i zat i on, it is necessarytodeternine howto s e par at e t he f at -1 r ee i nt i 
s t acks and howeach cons tit uent s t ack i s conpos ed. W wi s h t o avoi d bui 1 di ng nany s t acks 
of di ffer ent geomet r i es , and hence needi ng to desi gn and f abr i cat e nany di ffer ent pc - boar ds 
and ot her conponent s . W want as much as pos si bl e to be reus abl e when s cal i ng t o 1 ar ger 
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and larger sized fat- trees. Essentially, we need a st andar d r epl i cabl e structure t hat cai 
serve as a bui 1 di ng bl ock for fat- trees i n mich the sane my that a set of i denti cal routi ng 
conponent s can be us ed t o cons t r uct a r out i ng s t ack. 

3.3.1 Uiit Thee Structure 

The irit tree i s a si ngl e stack structure f r omwhi cha fat- tree int er connect i on of a w de 
variety of sizes can be built. This basic building block, when replicated and properly 
arranged will realize the fat- tree structure describedherein. 


Si ze 

As not ed, cur r ent t echnol ogylinits the size of asi ngl e hor i zont al pc - boar d rout i ng 1 ayer 
to about 2x2'. Qventhe target size of BN1 a'£ 'A2L4!', if we leave an equal amount 
of space bet ween each rout i ng conponent f or rout i ng wi r es , we can pi ace about 8 X 8 =64 
routi ng conponent s i n a si ngl e routi ng 1 ayer. As such, a uni t tree i s constrai ned to thi s size 
f or t he pr es ent t i me. As t echnol ogy i npr oves , t he nunber of conponent s per 1 ayer wi 11 , no 
doubt, i ncr eas e; however, this sc he me is very sc al abl e and can easily be modi fie d to t ake 
advant age of i npr oved t echnol ogy. For t he r enai nder of t hi s paper , 64 r out i ng conponent s 
wi 11 be as sumed as t he naxi miml ayer size. 

The basi c structure of the fat-tree describedinSection2.2 showed that the fat-tree is 
bui It byreplicatingaseries of one upwar d rout i ng s t age and 3 downwar d rout i ng s t ages . 
These 4 stages describe the natural structure of the tree. This gives a natural division 
bet ween r out i ng 1 e vel s appr opr i at e for i ncl us i on i n each uni t tree stack. 

The uni t treeis t hus cons tructedin41 ayer s as s hown i n Fi gur es 2.5, 2.6, and 2.3. The up 
r out i ng 1 ayer is at t he has e and i s conpos ed of t he naxi mim64 rout i ng conponent s . 11 i s 
f ol 1 owed by t he first down rout i ng stage whi ch i s al s o conpos ed of 64 rout i ng conponent sin 
or der to mat ch bandwi dt h wi t h t he up rout i ng s t age. The next down rout i ng s t age pr ovi des 
i nput t o onl y t hr ee- f our t h of t he fir s t down rout i ng s t age si nee 1 at er al i nput s consume one- 
quarter of the input capacity to the first down routing stage. It is thus conposed of 
only 48 routing conponent s. The third, and final down routing stage provides inputs to 
the second down routi ng stage. Si nee the second down routi ng stage has one - thi rd of its 
i nput s dedicatedtolateral cross over i nput s, thethird stage is onl y t wo-1 hi r ds t he size c 
the second. This nakes it one-half the size of the first down routing stage whi ch i s 32 
conponents. Thus, the conponent count for a unit tree is shown in Thbles 3.1. Each 
such unit tree thus requires a total of 208 r out i ng conponent s . Wthfour routi ng 1 ayer s , 
the stack should be about "ltial 1 naki ng the entire s t ack r ou^kl| ,, 2: 1.5”. Since 
there are several variations of this basic unit tree wor t h consi deri ng, when i t is necessar 
to di fferenti ate theiq thi s parti cul ar uni t tree wi 11 be ref ^j^nced as IT 


Characteristies 

Wt h t hi s devel opment, t he s i ze and geomet r y of a has i c uni t tree is fully cons t r ai ned. 
As such we can suimari ze the bandwi dth characteri sties as gi ven i n Thbl es 3. 2. 
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Level 

Unit i ng Gbnponent s 

0 

64 

1 

64 

2 

48 

3 

32 


Thbl e 3.1: Uni t TTee Cbnponent S urinary 



Tbt al Bhndw dt h 

Tbt al Logi cal Channels 

Chpacity 

Up fromLeaves 

512 

64 

8 

Down t owar d Le ave s 

512 

64 

8 

Up t owar d Boot 

128 

1 

128 

Down f r o in Boot 

128 

1 

128 


Thbl e 3.2: Bandwidth 


Stack 

Layer 

Gbnponent 

Gbunt 

Arai 1 abl e 
Bhndw dt h 

ffeqiii red Bhi 
Lbwi Fht h 

ndw dth 

Ifj Rdh 

0 

64 

512 

512 

0 

1 

64 

512 

0 

384 

2 

48 

384 

0 

256 

3 

32 

256 

0 

128 


Thbl e 3.3: \fertical Thr ough Bandwi dt h 


Through Bhndw dt h 

All of the vertical straight through connect i ons bet ween hori zont al boards inthe stack 
s t r uct ur e are cont ai ned on t he rout i ng component s ’ package as describedinSectionl.3.1. 
This Units the number of vertical routing conduits available for up or dowmrouting con- 
necti ons whi ch si npl y pass through a 1 ayer wi thout bei ng i nvol ved i n the routi ng occurri ng 
on that layer. Each package provides suficient t hr ough r out i ng conduits for 8 bundles of 
9 wires apiece. No vertical vi as are needed for the downward routi ng path, except when 
the enti re downward bandwi dth must route through the upward routi ng 1 evel. The upward 
routi ng path requi res t hr ough bandwi dt h on al 1 the down r out i ng 1 evel s . Available and 
requi red bandwi dth i s sunnari zed i n the Thbl es 3.3. Q early, there is adequate verti cal 
t hr ough bandwi dt h avai 1 abl e f or t hi s s chene. 
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3. 3. 2 [free R>ot 


Two al t ernat ives exist for term nat ingthe fat- tree. It is possible si npl y t o bui 1 d t he 
tree up to a de sired size and 1 eave the bandwi dth out of the top of the topmost stage of uni t 
trees unus ed. Al t ernat el y, we can bui Ida cappi ng 1 e vel whi chutilizes the 1 ogi cal di r ect i o: 
whi ch woul d have r out ed f ur t her up t he tree to rout e laterally to anot her down rout i ng 
1 evel. 

The first opti on i s the conceptual 1 y cl eanest. Al 1 of the uni t tree stacks wi 11 be i denti cal 
Bout i ng wi 11 be i dent i cal on t r ees of all sizes. 

The 1 at er opt i on al 1 ows more size flexibilityins one cases. Knowi ngthe size of the stack 
becones i npor t ant t o routi ng consi derati ons when t he root st ackis capped i n thi s nanner. 

St acks withacaplevel will be sli ght 1 y di ffer ent f r omot her s t acks sot hat not all s t acks wi 1 ] 
be identical; however, t hes e capped s t acks can di ffer onl y by t he addi t i on of an additional 
routi ng 1 ayer to the top of a standard uni t tree stack. 


Qip Level 

Concept ual 1 y, t he root 1 ayer cons true ted for a capped uni t tree will be for the conver - 
gence of f our uni t trees; however , t he conponent s t hat conpos e t hi s 1 ayer can be distri but ed 
across t he f our s t acks . Thi s di s t r i but ion is ne cess ary si nee we cannot nake a si ngl e stack 
larger. At the top of a s t andar d uni t tree there are 128 channel s that woul d go further 
toward the root and 128 coning fromthe root. Wth a down bandwi dth of 128 i nto each 
of the 4 st acks and hence 1 ogi cal di recti ons , we need a t ot al of 64 routi ng conponent s t o 
conpos e t he final level. Che-fourth of t hes e conponent s can be pi aced i n each s t ack. This 
appr opr i at el y gi ves 128 out put s f r omt his newlevel tot he 128 i nput s t o t he down r out i ng 
1 evel whi chwas fornerlythe top of each stack. Each stack wi 11 ^ffe*Sfhvmn- 
nections through the capped root back to itself and 32 connect i ons fromeachof the three 
adjacent s t acks i nt o i t s or i gi nal t op down rout i ng 1 evel . Sinilarly, each s t ack wi 11 connec 
to 3 ot her stacks with32 connect i ons . Obvi ousl y, if t he r oot 1 evel of uni t trees is conpos ed 
of nor e t han 4 stacks, the 32 connect i on in each 1 ogi cal di r ect i on s houl d be maxi mal 1 y 
di s t ri but ed over al 1 stacks conposi ng each 1 ogi cal direction. 

For t he s ake of cl ar i t y, t hi s capped uni t tree will be r ef eg^^ced as IT 


3. 3. 3 Bbi 1 di ng a fht - Tfree Romliii t Sees 

Each bi del t a 1 eaf cl us t er has a bandwi cl^fljf i nt o t he f at -1 r ee, whqj^ i\fe t he 
nunber of processors intheleaf cl us ter. Asi ngl e uni t tree has a bandwi dt h of 8 i n each of 
the 64 1 ogi cal di recti ons it di sti ngui s he s ^ffis isnidh,trees are requi red to construct 
t he s nal lest fat- tree whi ch has onl y a si ngl e 1 ayer of uni t trees. The next 1 ar ger size can 
t hen be cons t r uct ed by r epl i cat i ng t hi s smal lest structure 64times and usi ng anot her 1 ayer 
of uni t trees toint er connect t hes e 64 subt r ees . Bio ugh uni t trees at this second level wi 11 
be needed to natch the bandwi dth out of the t op of the first stage of unit trees. This 
pr ogres si on can be conti nued to bui 1 d fat- trees arbi trari 1 y 1 arge. 
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Ekanpl es 


12K As an initial exanple, consider bui 1 di ng a 12K processor machine. W can use a 
_E?i 92 1 eaf cluster for the leaf. The next level is then li^l=tldiutimft trees. This 
gives 192 • 64 = 12288 processors usi nggjdbiilel t a clusters and l| 4 jP’unit tree 
bl ocks. 


48K Thi ng a cappi ng 1 evel at the t op of a si ngl e 1 ayer of uni t trees , we can support 48K 
processors, four tines as nany processors as the previous exanple. The number of unit 
trees si npl y needs to growby a factor of four. Si nil arl y, four ti nes as many bideltaclusters 
are needed. Thus , t hi s i s construct ed f rojg^Bda® cl us t er s and 6 ^ 4 Wc uni t trees. 


768K Adding a full second level of unit trees allows us to c&nnd<9!2 64786 K 
processors. Thi s r equi res abideltaleaf cl us ter for eachl92processors: 


768 K 
192 


= 4 K 


The 1 owes t 1 evel of uni t trees is conpos ed of uni t trees gi ven by t he f ol 1 owi ng: 


d 


768 K 192 
192 ‘TT 


a 



l 

64 


C 


= 1 K 


(a) is the number of bi delta clusters that connect to this level. ( 6 ) is the number of unit 
tree stacks ne cess ary to satisfy the bandwd dt h for fig^bii^hi tB bl ock. (c) is t he 
f r act i on of t he uni t trees fr om( 6 ) t hat ar e act ual lyusedbya si ngl e bi del t a bl ock. ( d) whi ch 
i s t he product of (c) and ( 6 ) is t he r at i o of t he number of bi del t a cl us t er s to uni t tree s t acks 
requi red to bui ldthis first 1 evel . The bandwi dth out of the top of a uni t tree i s one- fourth 
the bandwd dth i nto i t. Thus the next 1 evel wd 11 onl y requi re one f ourth as nany uni t trees: 


1024 

4 


= 256 


This brings the total structure to ^9Billfelta clusters and 128@4|Funi t tree 
s t acks. 


Gbneralizing 

W can general i ze net work sizes and conpos itioninterns of the number of uni t tree 
layers used for net work cons t r uct i on. Ag^jn,s Mie number of processors composing 
a si ngl ebideltaleaf cl us ter. i is a par arret er denot i ng t he t ot al number of stack leve Is, 
i ncl udi ngthe leaf cl us ter stack; i is cons t r ai ned onl y t o be a non- negat i ve i nt eger gr eat ej 
t han one. The t ot al number of pr oces s or s suppor t ed by a net work wd t h i— 1 uni t t r ee 1 ayer s 
all of type is thus $tai as shown i n Equati on 3. 1. 

N total = 64( i4 ) • $ eaf (3.1) 
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This requires one bi del t a cl us t er f oj^gjaphoftTes s or s . 

N bide ita= (3.2) 

The first layer of unit trees is conpo^ c^bf - 2^ IT 648 unit trees for each bi delta 
cl us t er. Each successive 1 ayer r equi res one - f our t h as many uni t trees si nee the bandwi dt h 
out of the top of each uni t tree is one-f our t h t he bandwi dt h i nt o t he bottom Thus the 
total nunber of unit trees necessary to construct a net wor k qj^isesMipl y 

as gi ven by Equati on 3. 3. 


N, 


unit 


y J Noted Nleaf 1_ _J_\ 

~i\ N leaf' 12 ' M' + ) ) 


f Ntotal 
\12 • 6 





(3.3) 


Thus a network of processors is constructed wj,;^ tM e af clusters ang^/V 
IT 648 uni t trees . 

Ibi ng capped uni t trees at the top 1 evel, the conposi ti on i s si i ghtl y di fferent. The total 
nunber of pr oces s or s is greater by a f act or of f our becaus e of t he ext r a r out i ng s t age. 

N tota i= 4- 64 i4 ) • JJ4 o/ (3.4) 


The nunber of bi del t a 1 eaf cl us t er s needed i s conput ed as before. 


Nbi del ta 


Ntotal 

Nleaf 


(3.5) 


The nunber of stacks progresses the sane. The difference here is that the top level is 
constructed out of 6® c unit trees. Thus the nunber ofeff unit trees,and 
IF 648 c unit trees c ^ t are conputed by Equati ons 3. 6 and 3. 7. 


N ■* 

iy umt 

N • + 

iy curat 


{ Ntotal Nj ea f 1 1 \ _ / Ntotal\ (-y_ ^(24) A 

\ N Uaf ' 12 ' 64 ' 4(J -4 ) ) - V 576 ) \ > 

Ntotal 

768 • (4*) 


(3.6) 

(3.7) 


3.3.4 Alternative Uiit Ti*ee 

It is al so worthwhi 1 e to consi der an al ternati ve uni t tree structure. In parti cul ar, thi s 
necessaryif we wishto cons truct entirely fat- tree net works ; t hat is if we wdshto cons t r uct 
net works t hat have pr oces s or s as 1 eaves rat her t han bideltaclusters. Al one, t he uni t t r ee j us 1 
describedis i nadequat e f or t hi s pur pos e becaus e i t wi 11 al ways pr ovi de at 1 eas t 8 connect i ons 
i n each 1 ogi cal di recti on. To nat ch the connecti ons to a si ngl e processor, we need a uni t 
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t ree t hat provi des t wo 1 ogi cal connect i ons per di rect i on. Thi s uni t tree can be usedbyitsel 
to construct such a conpl ete fat -tree net work, or used onl y as the bottommost tree stage 
of a f at -1 r e e ne t wor k ut i 1 i z iffli t trees for the c ons t r uc t i on of t he uppe r portion 
of t he tree. For di s t i net i on, t hi s uni t tree wi 11 be ref % 3 ;ge.d to as ff 


Ihult Tblerance 

Tb get the desiredtwo out put s , and hence i nput s , i n each 1 ogi cal di r ect i on, we essentially 
need to scale down t he size of t he uni t tree cons t r uct ed by a f act or of f our f r omt hat of 
the unit tree. In nai vel y s cal i ng down t he uni t t r ee si ze, vie encount er one pr obi em 

at the final down rout i ng stage. Each si ngl e rout i ng conponent becomes critical in order 
t o pr ovi de ne t wor k c onne ctivitytofourprocessors; t hat is, uni ike t he c onpone nt i n ot he r 
stages, if one of these conponents fail, four processors will become unreachable. 

Thi s i dent i cal s or t of pr obi emoccur sinthebideltaclusters. As describedinSecti ons 1. 
and 1.5, t hi s was t he r eas on f or i npl ement i ng t he al t er nat i ve configur at i on f or BNl as t wo 
separate 4 X 4 crossbars. WthBNl configur ed i n t hi s manner at the bottommost down 
routi ng stage, i nstead of havi ng a si ngl e conponent that nakes the final routi ng connecti on 
to 4 process or s, vie have 2 conponent s t hat make t he final connecti on to 8 process or s. Thi s 
way no si ngl e conponent is critic al tothef unct i onal i t y of t he net work. 

Tb avoi d havi ng a si m 1 ar pr obi emon t he i nput s , we al s o configure these as inthe bi delta 
configur at i on. Wi le eachprocessor will use onl y a s i ngl e connecti on i nt o t he net work at 
a ti ire to guarantee that the network is not overloaded, each pr oces s or has two network 
connections. Each of these net work connecti on shoul d be connected to di fferent routing 
conponent s . The pr oces s or t hen guar ant ees t o onl y us e one of t he t wo net work connecti ons 
at a t i ne. Inthis nanner , no s i ngl e conponent i n t he first up rout i ng stage of t he net work 
i s cri ti cal si nee a gi ven processor can al ways ori gi nate a connecti on through the network 
t hr ough ei t her of t he t wo rout i ng conponent s t o whi chit is connect ed. 


Si ze 

The enti re lTi% ends up bei ng one- quart er the size of Ifflii t tree. 
i s thus conpos ed of a t ot al of 52 BNl routi ng conponent s . The st ages are each conposed 
of one-fourth as nany routi ng conponent s as the correspondi ng stages 64 #i lubna TT 
tree sumrar i zed i n Thbl e 3. 4. The 1 ar ges t r out i ng 1 ayer has 16 conponent s . These can be 
ar r ange das 4 X 4 c onpone nt s i n t he hor i z ont al pi ane. Thi s nake s e ac h s i d^Qjf t he IT 
r oughl y hal f t he size of each si de ofel^ieufiST t tree. Thus each si de of t t 

tree will be about one foot long. The nunber of stages is the same betw^g^the IT 
and IT 648 uni t trees sot hey wd 11 be t he s ame hei ght. 


Characteristies 

Thi s conposi ti on gi ves th% 4 ^T uni t tree the characteri sties shown i n Thbl es 3. 5. 
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Level 

Unit i ng Gbnponent s 

0 

16 

1 

16 

2 

12 

3 

8 


Thbl e 3.4: IMa-% Cbnponent Suimary 



Tbt al Bhndw dt h 

Tbtal Logical Channels 

Chpacity 

Up fromLeaves 

128 

64 

2 

Down t owar d Le ave s 

128 

64 

2 

Up t owar d Boot 

32 

1 

32 

Down f r omBoot 

32 

1 

32 


Thbl e 3.5: ITa-% Bandwidth 


Through Bhndw dt h 

As wi t h t he UL 64 $ uni t tree all vertical r out i ng vi as bet ween rout ing levels inthe stack 
s t r uc t ur e are c ont ai ne d i n t he r out i ng c onpone nts. Si nee this is onlyascaleddownversion 
of the lTi-» uni t tree, we have no newprobl errs wi th verti cal through routi ng bandwi dth. 

Bbi 1 di ng Thees 

Bui ldingfat trees witl^jgT uni t trees is done i n nuch t he s ane narnier as before. 

Thi s uni t tree can be usedas the onl y ki nd of uni t tree in cons tructing the fat- tree. Alter- 

nat el y, i t can be us ed as t he 1 eaf node i n pi ace of t he bi del t a cl us t eg£@ anmdtt he IT 

tree can be used to bui Id the rest of the structure for the tre§ 4 ^ Uhhn£ tTees 

for t he upper tree structure gives better r out i ng per f or nance and is hence pr ef er abl e. Thi s 

di ffer ence i n r out i ng per f or nance ar i s es f r omt he us age of t he al t er nat e BN1 configur at i on 

inthe final down rout ing stage of t lig^fEini t tree. This nakes t heed’s routing 

per f or nance s 1 i ght lyless desir abl e t han t hat t hni tTt r ee wher e t he 1 as t stage 

is configur ed nor rial 1 y. 

Si nee t her e i s a per f or nance di ffer ent i al, i t s r eas onabl e t o onl y consi (fe^usi ng t he IT 
uni t trees at the hot t omnos t 1 ayer of t he f at -1 r ee. The fat- trees so cons true ted are free to 
be any power of 64. 

Ntotai =64* (3.8) 

Che IT 64 $ unit tree is needed for each 64 processors. Thus the total nunber needed, 

Nfmti is gi ven by Equat i on 3. 9. 

N fimkt = ^f- (3.9) 
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The nunber of fF 648 unit trees in the next level is det er ni ned by nat chi ng t he total 
bandwidth out of the tops of th^gFunit tree layer with the bandwidth into each 
IT 648 uni t tree. As before successi ve uni t tree 1 evel s each requi re one- quarter the nunber 
of the uni t trees as the precedi ng 1 evel. As such, Equati on 3. 10 gi ves the6¥#nber of IT 
uni t trees needed ,^$ 



Ekanple Iki ng two 1 evel s of lTk-» styl e uni t trees on top of one 1 eve^gf sQZ^l e 
unit trees, the fat-tree network will su*ppo256ffi^)r oces sor s. This requires unit 
trees as f ol 1 ows : 



3.3.5 Wri ng Efetai 1 s for Uiit 'Sees 

As describedinSecti ons 1.3.2 and 3.2.2, inordertospecify nor e t han 8 bi t s of rout i ng 
infornation, the first routing byte mist be periodically stripped fromthe data stream 
Wt hi n t he uni t treestructure, this ope rat i on shoul dlogicallybe performed at three pi aces. 

1. every f ourt h uni t tree up on up t he routi ng pat h 

2. at the lateral cr os s over s bet ween t he up and down r out i ng t r ees 

3. upon entrance of a uni t tree at the top downrouti ng 1 evel when the connecti on crosses 
over hi gher up i n t he f at -1 r ee. 

The first case will onl y be neces s ar y when more t han 8 bi t s are needed t o speci f y up r out i ng. 

Thi s becomes neces s ary onl y when we have more t h $gbfN 6% proces s or s so will not be 
1 i kel y to be neces s ary in t he near f ut ur e. The need for lateral cross overs is describedir 
Section3.2.2. Eecall t hat t he wi r i ng cons t r ai nt s of Sect i on 2. 3 r e commend t hat i nput s f r om 
different logical directions be di s t r i but ed acr os s as nany di ffer ent routi ng component s as 
pos s i bl e. As s uch, 1 at er al cr os s over i nput s wi 11 onl y nake up a f r act i on of t he i nput s t o a 
given routing conponent and hence be configured di fferentl y than the r eirai ni ng i nput s ; 
fortunately, BN1 al 1 ows the the s wal 1 owpr oper t y t o be configured i ndependentl y for each 
input. The logical place for s t r i ppi ng byt es in the downwar d pat h i s at the entrance of 
each unit tree fromhigher in the fat-tree. This provides the most regular place for this 
f unct i on. Thi s i s s one what non- opt i rial i n t hat t he s wal 1 ow occur s betwaenevery3 st ages 
of down rout i ng, as oppos ed to every 4; this means t hat t he rout i ng byt e is bei ng changed 
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si i ght 1 y more frequent 1 y t han i t coul d be i n bes t cas e. As such t he 1 ow t wo bi t s of each 
down r out i ng byt e ar e unus e d. 

The rotation of bits between unit tree stages mist be carefully arranged so that all 
paths through the fat-tree network rotate the routing bits equivalently, as described in 
Sect i on 3. 2. 3. Obvi ous 1 y, each byt e- wi de rout i ng path mist be rot at ed by 2 bi t s f ol 1 owi ng 
each up routi ng stage and f ol 1 owi ng each down routi ng stage. Si nee there are onl y three 
down routi ng stages through each uni t tree, f ol 1 owi ng the 1 ast down routi ng stage there 
mist actually be a four bit rotation so that the bits are correctly aligned to enter the 
f ol 1 owi ng uni t tree. The lateral crossover connections pose the least strai ghtf orward bi i 
rotations. Each lateral crossover mist guarantee that as the connecti on enters the down 
rout i ng pat h, t he bi t r ot at i on i s cons istent withthe poi nt at whi ch t he down rout i ng pat his 
ent ere d. Wi le this is easy enough t o guar ant ee, t hi s i npl i es t hat t he amount of r ot at i on i n 
each cross over wi 11 di ffer dependi ng on t he hei ght of t he uni t tree inthe fat- tree; this result 
fromthe di fference i n the 1 ength of the up routi ng path traversed before encounteri ng the 
crossover. This requirement unfortunately, forces unit trees to differ slightly depending 
upon whi chi ayer of t he f at -1 r ee t hey are i npl ement i ng. The effect of t hi s di ffer e nee can be 
ni ni ni zed by 1 ocat i ng t he neces s ar y r ot at i on di ffer ence ent i r el y i n t he hor i zont al pc - rout i ng 
boar d j us t above t he up r out i ng s t age. Thi s localizes the di fference in uni t trees to a si ngl e 
horizontal routi ng boar d. Of course, since the four stages of two bit rotations return the 
rotation to the original rotation, at most four such distinct boards will be necessary to 
bui ldanarbitrarilylarge fat- tree. 


3. 3. 6 Wre Account i ng 

There are a 1 arge nunber of wi res enteri ng and 1 eavi ng each uni t tree stack i n order 
to properly connect the unit tree in the network. These wires, by necessity mist go to 
a nunber of di ffer ent 1 ocat i ons . Wi 1 e each dat a pat h of 9 wi r es coul d go to a di ffer ent 
destination, actually allowing themto do so woul d unneces s ar i 1 y conplicate the wiring 
pat tern. Tb pr event t hi s , we woul d 1 i ke to group t oget her re as onabl ysizedwire bundl es t o 
i nt e r c onne c t uni t trees and c onne c t uni t trees toleaf nodes. 


ur 64 ;g 

The bandwi dt h ent er i ng and 1 eavi ng t he hot t omof t^g Glhi t tree stackis 1 ogi cal 1 y 
s egr egat ed i nt o 64 di r ect i ons each of whi ch i s 8 channel s wide. Each channel is conposed 
of 9 bits. There are equally many channels going both into the stack frombelowand 
downward out of the stack. Goupi ng thi s together, we have 2x8x9 =144 wi res i n each 
1 ogi cal di recti on. It makes sense to use this is athe standard wi re bundl e si ze. 

The bandwi dth 1 eavi ng the top of one of these uni t trees i s conposes a si ngl e 1 ogi cal 
di r ect i on. However, si nee it will inturnbe connect i ng fj^oluhffii flees, it nakes 
most sense to use the sane wi re bundl e si ze. Si nee there are onl y one- fourth the bandwi dth 
exi t i ng t he t op of t he stack, t her e wi 11 onl ybe 16 such bundl es out of t he t op of a uni t tree. 
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Uii t Thee T^pe 

Niuber of Bundles 

Bundles Size 

Tbtal Wres 

IT 64 $ Bot t om 

64 

144 

9216 

Tbp 

16 

144 

2304 

IT 64 $c Bottom 

64 

144 

9216 

Tbp 

12 

144 

1728 

IT 64 $ Bot t om 

64 

36 

2304 

Tbp 

4 

144 

576 


Thble3.6: Uni t TTee Wr e Bundl i ng f or External Connections 


IT 64 $c 

Si nee the capped uni t tr ee^ff’, is si npl y a vari ati on on tha#!^ 1 i t i s very 
sinilar. The bandvi dt h i nt o the bottomis identical t^^thea63?. The vires out 
of t he t op are di ffer ent. In t hi s case, each uni t tree will potentiallybe connect i ng t o 3 or 
more others. For consistency, bundles of 144 vires can be used here as veil. There are 
three 1 ogi cal di recti ons out of the toj^^^. ^ f®ch di recti on has vi th one - fourth of 
t he t ot al bandvi dt h out of t he t op of t*^ CFThi s gi ves us 4 bundl es of 144 vi r es for 
each of t he 3 1 ogi cal di r ect i ons . 

it 64$ 

The bandvi dt h i nt o t he hot t omof theeff unit tree is di stri buted as 64 pairs of 
channel s . Thi s means there are 2x2x9 =36 vires in each 1 ogi cal di recti on. Agai n, the 
bandvi dt h out of t he t op of t he uni t tree is one 1 ogi cal uni t. Thi s t op bandvi dt h needs to 
be di vi ded i nt o bundl es t he s ame size as those out of t he hot t onyj^ tirni fflt r ees 
to which this vi 11 be connecting. This means the t op bandvi dt h s houl d be broken down 
i nt o 4 bundl es of 144 vi r es each. 

Uii t Thee Ext ernal Hindi e Surmary 

Thbl e 3. 6 suimar i zes t he bundl es i n and out of t he uni t trees discussedinthis secti on. 

3.4 Gbonatry 

3.4.1 Bhsic R*operties 

E’omthe preceding section, ve see that the total nunber of unit trees needed to con¬ 
struct eachlevel decreases by a fact or of f our on each succes s i ve 1 evel t ovar dthe tree root. 
Looki ng at the conposi t i on of each uni t tree, ve see a common conver gence f rom64 uni t s 
of a gi ven si ze at one s t age to 16 such uni t s at t he next phys i cal s t age up t he t r ee. The size 
of these uni t s i nvol ved i n the convergence i ncr eases f roma si ngl e st ack at t he fir st st age by 
a fact or of 16 for eachsuccessive stage. 
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Ekanple Looki ng back at the 768Kprocessor exanpl e of Secti on 3. 3. 3, the termnal 1 evel 
i s conpos ed of 4096x^3 s t acks , t he hot t omtree 1 evel of 101$ EZt acks , and t he top 
tree level of stacks. The convergence at the first level is ffg§m64c_fis to 

a s et of 16 lTi-» s t acks . At t he next 1 evel, t hes e s et s qf^lS Ofcks f ormt he has i c 
logical unit. Si xt y- Four of these units, whi ch i s 64- MKbfcKT, converge to 

16 of these sets of ldeiT stacks at the root level. These 16- 16 =e4%6 ffacks 
then formthe root structure for the fat - tree. 

3. 4. 2 G-owt h 

There is no recursi vel y repeatabl e, t hr ee-di nensi onal structure that keeps inter-stage 
i nterconnecti on di stances constant. Thi s can easi 1 y be seen by noti ng that: 

• The nunber of conponents needi ng i nt er connect grows exponentially. 

• Keepi ng i nt er-s t age distances constant, the three-di nensi onal space of candidate lo¬ 
cations for conponents is bounded by cubi c growth. 

Thus , there i s no way to prevent i nter- stage del ays f romgrowi ng between successi ve 1 evel s 
as the systemis sc aled up insize. At best, we can hope tokeepthe inter-st age del ay gr owt h 
down t o a r eas onabl e 1 evel. 

Nate, however, that since the nunber of processors growth® S^rstemgrows very 
r api dl y. As such, we need onl y accoimodat e afewstages of gr owt h i n or der to bui 1 d t he 
networks of i nteresti ng si ze for the near future. 

3.4.3 H)11 owChbes 

Anat ur al appr oach t o accoimodat i ng t hi s 4:1 conver gence i n a wor 1 dlinitedto three- 

dinensions, is t o bui 1 d hoi 1 ow cubes. If we select one side as the “top” of the cube, the 

f our si des can accoimodat e f our t i nes t he s ur f ace ar ea of t he top, and nat ur al 1 y f our t i nes 
the nunber of r out i ng s t acks of a given size. As such, the “sides” contain the converging 
s t acks f r omone 1 evel, and t he “t op” cont ai ns t he s et of s t acks at t he next 1 evel up t he tree 
to whi ch the “si de” stacks are convergi ng. The renai ni ng si de wi 11 renai n open, f ree of 
s t acks . Thi s 1 as t s i de coul d be us ed t o shor ten vires sli ght 1 y f art her, but utilizingit int 
nanner woul ddecrease accessibility(see Secti on 3.4.5 for further discussionof this issue 

Tbstart, let us consi der t he fir s t level of convergence of t he f at -1 r ee s t r uct ur e. W want 

t o connect 64 stacks of a givensize to s one nunber of ot her s . Parti cul ar 1 y, we wi s h t o 

connect: 64 1 eaf no de s N ^ IT 64 « ’ s or 64 ’ s t o 4 IT 4 % ’s . Each s i de is nade 

of 16 stacks of the 1 eaf si ze. Each si de i s constructed by 1 ayi ng out the 16 consti tuent stacks 
i nt o a 4 X 4 array. The t op wi 11 be of conpar abl e size dependi ng on t he r el at i ve size of 
the stacks at each fat-tree level. Figure 3.1 shows a hollow cube configurati on i n whi ch 
t he s t acks at each s t age are of equi val ent size (egt this will be the case when t he 1 ower 
1 evel i siggl eaf clusters and t he top tUni t trees). Fi gur e 3. 2 shows a hoi 1 owcube 
configurati on where the top 1 evel stacks are four ti ires the si ze of the 1 ower 1 evel ( eg. thi s 
woul d be case when on t he fir s t level of a f ul 1 f at -1 r ee net wor k i n whi ch t he t op 1 evel was 
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Figure 3.1: First Level HbllowGibe Geometry 



Fi gur e 3.2: Hal 1 ow Gibe wi t h Tbp and Si de St acks of El ffer ent Si zes 


cons t r uct ed f r omft uni t trees and t he bot t oxiuids t s t age ms conpos edg^fg CShi t 
trees). 

At the next stage of convergence, we treat the unit we just created as the base unit. 
These can be arranged such that the tops of these hollowcube units are treated as the 
basi c uni t structure for thi s next stage. Then we arrange a pi ane of 4 X 4 of these for each 
s i de. Apl ane of uni t trees of size equal t o each of t he si des is t hen us ed f or t he “t op”. Thi s 
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Figure 3.3: Second Level Hal low Cube Geometry 


next stage i s shown i n Fi gure 3. 3 as the natural extensi on of Fi gure 3. 1. Thi s progressi on 
can be conti nued i n thi s nanner cdiiiftmi 


3. 4. 4 Gbnvergence Si ze 

Beyond t he fir s t 1 evel, t he size pr ogres si on is f ai r 1 y s t r ai ght f or rnr d. The side size wi 1 
grow by 4 at each successive stage since the size growth of 16 i s accomrodated in two 
di mensi ons. 

At the first level, the size of convergence depends on the size of the leaf stacks used 
compared t o t he ’s to whi ch t hey connect. Les s wi 11 be r equi r ed t he s nal 1 er 

the leaf stack size. Ingeneral 64 leaf s t acks e^gfeieSSf 6^8 stacks. Fromthis ve 
can determine the side size of the initial convergence square, or “top”, and hence the side 
1 engt h f or t he hoi 1 owcube. The si de si ze f or t he cube wi 11 be t he gr eat er of t he si de size of 
the “top” and the si de si ze of the “si des^’.bdjfehd 1 ength of the si de ofe^^Fstack 
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and li ea f be t he si de 1 engt hforleaf stack; thesel engt hs w 11 be as gi ven by Equat i ons 3.11 
and 3.12, respectively. The 1 ength of a cube si de s j satlgiii;efL i n Equati on 3. 13. 



(3.11) 

(3.12) 

(3.13) 


As describedinSection3. l;. e 3# vsM1 al mys be a mil t i pi e of t hr ee. For almost all sizes 
of i nt er es t 11 also be a nultiple of four. As such, the conponent arrangements 

will alrays be a square nunber and the ceiling functions can be dropped. This reduces 
t he tw> ar gument s of t he nax f unct i on t o s ame express i on. So Equat ion 3. 13 reduces t o 
Equation 3. 14. 



(3.14) 


The fact t hat t he t wo 1 engt hs reduced to essentially the s ame expr es si on ms pr edi ct abl e 
si nee t he bandwi dt h out of t he t op of a s t ack i s one - quarter t he bandwi dt h i nt o i t s hot t om 
The f our s i de bandwi dt hs , whi ch ar e al 1 f r omt he t op of s t acks mat ch t he bandwi dt h of t he 
bottomof t he s t acks on t he “t op”. The surf ace area i s proporti onal t o bandwi dt h i n t hi s 
configurati on. 


3. 4. 5 Ifeat ures 

Wi 1 e thi s hoi 1 owcube structure nay not be the most conpact structure, i t does exhi bi t 
a nunber of ni ce properties. 

It exposes the enti re surf aces whi ch are i nterconnected to one other. Si nee t he band¬ 
width of t he i nt er connect i on i s 1 ar gel y surf ace ar ea 1 i ni t ed, t hi s all ows maxi numexposur e 
of the areas that need to be connected. 

Si nee t he s t r uct ur e is “hoi 1 ow” t he connect i ons can be wi r ed t hr ough t he free - space i n 
the center. Wri ng through free- space i n thi s manner, the naxi mimwi re 1 en^Sh i s onl y 
ti mes the 1 ength of a si de. Thi s naxi miml ength i ncreases by roughl y a factor of 4 every 
time m increase the nunber of processors by a factor of 64. Thi s factor of four i ncrease in 
size is due to the factor of four growth inside size for eachsuccessivelevel of the hoi 1 o 
cube } 

this centimes, it wll also be necessary to take the size of the “sides” of each square into account, 
raking the increase samvhat rare than a factor of four; for the sizes of present interest, this is not areal 
issue. 
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Thi s s t ruct ur e is hi ghl y re pi i cabl e. The pr ogr es si on des cr i bed in Sect i on 3.4. 3 can be 
conti nued arbi trari 1 y. The growth in wire lengths nay, however, prove to be undesirable 
for very large structures. 

I n t hi s f or iq t he i ndi vi dual s t acks are re as onabl y accessible for r epai r. Si nee t he cubes 
are hoi 1 ow, i t i s pos si bl e t o get at any i ndi vi dual st ack wd thout novi ng any other st acks . 
Thi s shoul d al 1 ow r epai r and inspection wit hout interfering with the bul k of t he net work 
oper at i on. There nay be s one di ffcul t y with accessibility due t o t he nas s of wi r e i nsi de 
t he cube, but perhaps this canbenininized(see Section3.5.3). The “ni s si ng” si xt h wal 1 
of t he cube can be us ed t o al 1 owent ranee i nt o t he cent er of t he s t r uct ur e for nai nt e nance 
and r epai r. 

3. 4. 6 Q)t i ml i t y 

Thi s s ol ut i on s eens pr act i cal whi 1 e gi vi ng r eas onabl e per f or nance for the sizes of interest 
for t he next decade. 11 i s not known t o be opt i nal for ni ni ni zi ng t he i nt er - s t age wi r i ng 
di s t anc e s . Fi ndi ng an opt i nal s ol ut i on t hat r e t ai ns ade quat e accessibilitytobe of practica 
interest is still an open issue. 

3.5 Ebl 1 owClibe Gbnst ruct i on 
3.5.1 Structure Size 

Be cal 1 f r omSect i ons 3.3.1 and 3. 3. 4 t hat t tree is r oughl ytwofeet s quar e 

and one and a hal f inches tall whi^.ggtl’s roughly one foot square. Arranging these, 
or sinilarly sized bideltaleaf clusters in 4 X 4 grids and bui 1 di ng cubes as j us t des cr i be 
in Section 3.4, creates structures with si des between 4 and 8 feet 1 ong. Fol 1 owi ng t hi s 
progress i on one step furt her, we see t hat t he si de 1 engt hs growtobetweenl6 and 32 f eet. 

Q ear 1 y, net works of this size, incurrent t echnol ogy, are r oomand bui lding sizedentities 
not s onet hi ng t o put on your desktop. 

3. 5. 2 Structure 

To achi eve t he hoi 1 owcube structureof Section3.4, it is necessaryto cons t r uct “r oons ” 
f or net wor ki ng. The “wal 1 s ” of t hes e “room ’’will be tiledwithstacks inthe 4x4 ar range- 
nent des cr i bed, as wi 11 be t he “cei 1 i ng”. These s t acks whi chtile the wal 1 s and ceiling will 
be arranged such that the tops of the wall stacks face into the rooiq and the bottoms of 
t he cei 1 i ng s t acks face i nt o t he r oom The wal 1 and cei 1 i ng t hi cknes s will berelatively snal 
since the stacks are thin. Girrent projections are for the stacks t'6 tbfei ahoulnl.5 
some cases, it will be appropriate for the “ceiling” structure to be something other than 
t he top or act ual cei 1 i ng of t he r ooip a r evi ewof Fi gur e 3. 3 wi 11 nake t hi s poi nt cl ear. Thi s 
shoul d pose no addi ti onal pr obi errs. 

Tb hoi d t hes e s t acks i n pi ace, t he wal 1 will beastructural grid. It will besinilartoa 
raisedfloor where the stacks are analogous to tiles and the wall structure is analogous to 
t he floor gr i d whi ch support s the tiles. Thi s gr i d pr ovi des sites for each of t he s t acks . It 
will t hen be possible to slide the stacks in and out of their sites, and “1 ock” t he mi nt o pi ace. 
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Due to the size and nat ur e of t hi s s t r uc t ur e, t he ml 1 w 11 al s o have to pr o vi de s t r uc t ur al 
support. Adding an additional “real” mil to support the structure would prevent close 
packi ng and requi re the structure to be nuch 1 arger. 

The gri d s tructure supporti ng the st acks wi 11 al so need condui t s t o al 1 owthe fluori nert 
whi chcools the stacks to flowt o t he s t acks . Si ni 1 ar 1 y, condui t s f or power suppl y connect i ons 
will also be needed. The actual fluorinert punps and power supplies can be placed in 
adj acent rooms or on the si xth si de of the cube. 

3. 5. 3 Wri ng 

I nt er connect i on s i gnal wi r i ng wi 11 be r out ed t hr ough t he free - space i n t he cent er of t he 
r oom 

Wre ftanas 

I ns t ead of connect i ng t he wi r e bundl es directlytoeachstack, these will be wired to a 
wd r i ng f r ame. Thi s wi r i ng f r ame i s a uni t at t ached t o t he s t r uct ur al gr i d of t he ml 1 and 
will serve as a hat ch- door on each rout ing side of a stack. All the wires to a stack are 
connect ed to t he wi r i ng frame. Wen t he wi r i ng frame is closed and lockedinto positi on, 
it wi 11 be conpr es s ed di r ect 1 y agai ns t t he rout i ng stack. The conpr es s i on makes electri cal 
cont act betweenthe stack and t he connect i ng wires est abl i shi ng si gnal flow. The wi r es are 
wired to the wiring frames instead of the stack to facilitate repair and ease of access to 
each stack. Thi s sc he me allows a stack to be changed wd t hout t he need t o di s connect and 
reconnect wi r es f r om80 or more di ffer ent s our ces . Wththewiring frame, t he frame can 
si npl y be opened to re pi ace a stack. 

The wd re frame i s “1 ocked” i nto pi ace when cl osed. Thi s al 1 ows fluor i nert and power to 
flowi nt o t he s t ack. Wen a frame is “uni ocked” t o remove a s t ack, several things should 
happen. The power t o t he stack shoul d be cut. The fluor i nert flowt o t he stack shoul d al s o 
cease, and t he fluor i ner t i n t he s t ack s houl d be dr ai ned. Additionally, i t nay be neces sary 
t o t er ni nat e t he t r ansni ssionlines ins ome wel 1 - behaved nanner for t he s ake of t he r es t of 
the network. Wen the fran® i s “re-locked”, t he fluor i ner t flow wi 11 need t o be resumed. 

Chce the systemhas had time to refill wi th fluori nert and get adequate power, it should 
be go t hr ough a reset s equence sot hat i t comes onl i ne i n a consi s t ent nanner. 

Q)t i cal Gbnnect i ons 

Wi 1 e t he hoi 1 owcube pr ovi des adequat e space for t he wi r i ng r equi red, t he wi r i ng wi t hi n 
t he cube will still be qui t e form dabl e. Keepi ng t r ack of t he 1 ar ge quant ityof wires will be 
a s er i ous task, especially when pr obi errs occur withwiring conduct i vi t y. 

An alternative that nay be technically viable by the time one of these networks is 
act ual 1 y cons t r uct ed woul d be t o us e opt i cal i nt er connect i ons t o pr ovi de t he wi r i ng t hr ough 
the cube. The cube i s basi cal 1 y free- space so that each 1 i ght beamneed onl y be ai med at 
its appr opr i ate receiver inorder to effect i nt er connect. Si nee 1 i ght beans wi 11 not i nt er f ei 
witheachother, therewill benoneedto worry about t he t hr ee- di mensi onal wi r i ng pr obi em 
of avoi di ng i nt er f er ence i n t he cent er of t he cube. [ Ber gnan 86] and [W87] discuss early 
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work to pr ovi de 1 ar ge s cal e opt i cal i nt er connect f or VLSI s ys t errs . Of parti cular interest is 
their use of a holographic optical element to direct optical beans for interconnections and 
the potential for adaptive and dynani c connect i on r econfigur at i on. 

Wt h opt i cal i nt er connect, si gnal pr opagat iontime across the connect i on is less sensitive 
t o wi r i ng nat er i al s . Long el ectri cal connecti ons wi 11 have del ay proport i onal to both wire 
1 engt h and t he per ni t i vi t y of t he nat e^)i ah fei ven by Equat ion 3. 15 where c i s t he 
speed of 1 i ght. 


^wi r &/^ 


wl re 


(3.15) 


Long optical interconnection, incontrast, cones cl os er t o t he f undanent al linitposedby 
t he s peed of 1 i ght as gi ven by Equat i on 3. 16 [ K am lev 89] . 


_ l 

^optical interconnect 


(3.16) 


Wthout wires occupying vol une in the center of the cube, the whole interior of the 
cube woul d be michiiDre accessible for repair and nai nt e nance. 

Che potential problemthat would arise with optical interconnect is keeping the nil- 
1 i ons of 1 as er connect i ons i n proper al i gnment. Thi s coul d be a ver y har d t as k and al i gn- 
ment night prove very sensitive to repair operations within the cube. Adaptive align¬ 
ment schemes would virtually be a necessity to make the fine alignment of a systemof 
this si ze tract abl e. Adapt i ve al i gnment woul d al 1 ow t he systemtoself adjust itself into 
proper configuration. Perhaps a mature version of the programmable optical interconnect 
of [K am lev 89] woul d pr ovi de a potent i al candidate for such adapt i ve al i gnment. 

If ut i 1 i zed, t he opt i cal i nt er connect woul d, of cour s e, be i nt egr at ed as part of t he wi r i n 
har ness. 


3.5.4 Mi nt enance 

Che very i npor t ant i s sue for t he hoi 1 owcube i s t hat mai nt enance i s pos s i bl e, and t hat 
it is possible whi 1 e the nachi ne i s i n ope rati on. Wth a systemof thi s size, and necessarily 
expense, ext ens i ve down ti me wi 11 be expens i ve. Additionally, wi th a systemthi s 1 arge, the 
nunber of f ai 1 ur es per uni t time is necessarilyproporti onal 1 y 1 arge r t han i n smal 1 s ys t errs . 
As a fir s t 1 i ne of def ens e agai ns t t hes e pr obi errs t he s ys t emi s desi gned t o be f aul t t ol er ant. 
Wen f aul t s do occur, however, i t wi 11 be neces s ar y t o fix t hembef or e t hey accumil at e. It 
is ver y desi r abl e t o be abl e per f or msuch r epai r s wi t hout conpl et el y di s abl i ng t he nachi ne. 

The “unused” sixth side of the cube provides access into the interior of the cube. Ch 
t hi s face a door or hat chcanbe placedto all owacces s i nt o t he i nt er nal structure. Wththe 
i ndi vi dual s t acks 1 ai d out cont i guousl y al ong t he wal 1 s and ceiling, eachisreadilyaccessib 
without any need to di spl ace any other stacks. 

Asi ngl e stack can be removed and replacedrelatively qui ckl y. Wt h t he wi r i ng at t achi ng 
to t he wi r e frame, t he whol e ope rat i on of r epl aci ng a f aul ty stack can be moderately short. 
The f aul tystack needs first tobelocated. Chce located, its wiring- frame can be “uni ocked”. 
The stack can t hen be r epl ace d and t he wi r i ng- f r ame “r el ocked” all owi ng t he net work t o 
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return to normal operation. The stack whi ch has just been removed can then be taken 
el se wher e so t hat i t s f aul t s can be i denti fied and repai red. 

Since there are redundant paths through the network using different intervening stacks 
f or r out ing, it will be possible to r enove an ent i r e stack f r omt he net work whi le the rest 
of the net work renai ns i n operati on. The pr acti cal effect of thi s wi 11 be that sons f racti on 
of the bandwi dt h wi 11 be “nissing” or rather appear faulty; this is how i t would looked 
anyway i f nuch of the routi ng stack was f aul ty. The network shoul d be wi red such as to 
guar ant ee t hat eachoutput fromone level of t he net work can rout e t hr ough s ever al different 
stacks at the next 1 evel . Wth thi s property and proper terninati on of the connecti ons to 
t he r e moved stack, t he net work wi 11 si npl y f ai 1 to rout e any connecti ons whi ch at t enpt 
t o t r aver se the stack be ing re pi ace d. The or i gi nat ingprocessor will thenretrythefailec 
connecti ons 1 at er and either utilize anot her pat h t hr ough t he net work, or t he s ame one af t er 
t he r epl acement stack is powered up. 

3. 5. 5 Tfechnol ogy Seal i ng 

These consi der at i ons , and t he sizes as suite d t hr oughout t hi s work, are bas ed mos 11 y on 
current, usable technology. Certainly, as interconnect technol ogy i ncreases and packagi ng 
sizes dininishfurther, this structurecansinilarlyde crease insize. Thi s de crease insize i 
1 i near 1 y t r ansi ate directlytoi npr overrent s i n t he i nt er connect speed between conponent s 
and s t ages . 

Che t echnol ogy t hat wi 11 perhaps be f easi bl e by t he t i me s ys t era of t hi s size become of 
r eal pr act i cal i nt er es t is t he ki nd of packagi ng us ed on t he G ay 111 [ G ay 89] . Utilizingthi 
ki nd of t echnol ogy woul d al 1 owt he size of t he conponent s t r uct ur es , and t hus t he r esul t i ng 
s t r uct ur e, t o be di ni ni shed by about a f act or of f our t o si x. Si ni 1 ar 1 y, pr os pect s of waf er 
s cal e i nt egr at i on offer r oughl y t he s ane size scaling advant ages . 

3. 6 Long Wres 

11 shoul d be cl ear f r omSect i on 3. 5. 1 t hat 1 ong wires will be r equi r ed f or i nt er connect i on 
bet ween uni t tree stacks. Here long wres are any wi r es whos e 1 engt h mis t be 1 onger t han 
the 1 ongest wire withina uni t tree stack. 

3.6.1 Strategy 

The clock cycle onthe uni t tree and bi del t a s t acks wi 11 be opt inized to be as s hor t as 
possi bl e gi ven the operati onal speed of the routi ng conponent and the 1 ength of the 1 ongest 
wi r e i n t he uni t tree. Thi s means it will necessarilyt ake mil tiple clock eye les for dat ate 
t raver s e t hes e 1 ong wires between uni t tree stacks. It is possi bl e to pi ace mil t i pi e dat a bi t 
on a set of wires s i mil t ane ous 1 y, but we mist be careful that the dat a i s kept in proper 
phas e wi t h t he cl ock i n or der to as sure proper behavi our of t he rout i ng chi p. The proper 
phas e can be as sur ed i n ei t her of a coupl e of ways . The i nt er connect i on wi r e 1 engt hs can 
be caref ul 1 y chosen such that thei r del ays are al ways i nt eger mil t i pi es of 1 engt h of t he uni t 
tree clockcycle. Inthis nanner, t he phas e is pr es er ved by guar ant eei ng t he del ay t hr ough 
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the wi re i nterconnecti on i s suffci ent 1 y wel 1 behaved. Alternately, t apped del ay 1 i ne buffer s 
can be used to i nsure that the data i s presented i n the proper phase rel ati on wi th the 
uni t tree clock. Aschene sinilar to [Eettberg 87] can be usedfor this pur pos e. If opt i cal 
i nt er connect is used(Section3.5.3), usi ng wi r es to nat ch i nt er connect del ays t o t he phas e 
of t he dat a wi 11 not be possible. In such a case it woul d be neces s ar y to us e t he t apped 
del ay 1 i ne al t er nat i ve. 

This scheme will necessarily increase the latency of a connection, but the increased 
1 at ency i s i nevi t abl e gi ven the geometri c cons trai nt s of routi ng. Thi s s cheme does manage 
t o ni ni ni ze t he effect s of s cal i ng on 1 at ency by onl y s 1 owi ng down t hos e i nt er connect i on 
stages which mist be long. 

3.6.2 Bsqiii reliant s 

The onl y addi ti onal requi rement s thi s scheme poses i s on BNl, the routi ng conponent. 

In parti cul arl y, it onl y requi res di fferent behavi our f r omBNl when a connecti on i s turned 
around (see Sect i on 1. 3). Gir r ent 1 y, BNl expect s to get val i d dat a in t he new di r ect i on 
of flowtwo clock cycles following the reception of a byte indicating it should turn the 
connecti on around. The addi ti on of a si ngl e cl ock cycl e’s worth of wi re del ay causes thi s 
t ur n around time to be increased by two clock cycles; t hat i s one addi t i onal clock cycle 
is required for the turn byte to propagate across the interconnect to the next routing 
conponent, and one additional clock cycle is required for the return data to propagate 
back across the connecti on. Wth no extra cl ock cycl es of wi re del ay between BNl routi ng 
conponent s, this turnwill occur in tvn clock cycles. Wth k clock eye les of wire del ay 
between a pair of successive routing stages, the turn will occur in 2(fc + l) clock cycles. 

I n or der for t he t ur n s equence t o f unct i on properl y, t he r out i ng conponent at each end of 
s uch a 1 ong wi r e mis t be capabl e of deal i ng wi t h t he ext r a cycl es . BNl will needto know 
t he number of del ay cycl es to expect and be abl e to deal witht he mac cor di ngl y. 

The delay size will need to be configurable for ever input and output port on the 
routing conponent. Wth 8 input ports and 8 output ports, this nakes for a total of 
16 ports that need to be configured on a single BNl conponent. It is necessary to be 
abl e t o configure each i nput and output port separately as es t abl i s hed i n Sect i on 2. 3 t o 
allowinput and output ports fromthe same conponent to connect t o i nt er connect i ons of 
di fferi ng del ay 1 engths . Addi ti onal 1 y, t o al 1 owf or the 1 arge range of del ay val ues that mist 
be ac comm dat edfor reas onabl y 1 ar ge net ^drlte (lei ay 1 engt h configur at i on wi 11 requi r e 
mil tiplebits tospecifythe del ay of a si ngl e i nput or out put por t. Wi 1 e t hi s configur at i on 
i nf or nat i on coul d be pr ovi ded t o t he chi p by configur at i on pi ns , it shoul d be obvi ous t hat 
this would require the addition of quite a large number of signal pins. BNl already has 
tight constraints on the number of pins that its package will support. Thus, an alternate 
me ans mis t be usedto c onfigur e an BNl r out i ng c onpone nt wi t h t hi s i nf or mat ion. 

Che s uch al t er nat i ve i s t o us e UV pr ogr aimabl e cells inthe rout i ng chi p to store this 
configur ati on data. This woul d requi r e the addi ti on of onl y a f ew si gnal pins to facilitate 
t he pr ogr aimi ng of t he UV configur at i on cel 1 s . Cells woul d be pr ogr anme d by i ni t i at - 
i ng a programs equence and shifting the configuration data into the UVcells while the 

2 See Section 4.1 for projected vire and del ay cycl e lengths insystem of various sizes. 
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conponent is exposedto UV1 i ght. [ flasser 85] describes a t echni que f or cons t r uct i ng UV 
pr ogr amrabl e cells of this nat ur e whi ch i s appl i cabl e to t he one ni cr on CMD6 process in 
whi ch BN1 i s fabricated. This scheme has the drawback that a conponent cannot simply 
be repl aced wi th another one off the shelf. The r epl acement conponent will first need to 
be configured before i t can serve as a repl acement. 

A si npl e pr ogr aimi ng boar d can easi 1 y be bui It to pr ogr amt he configur at i on i nt o a 
conponent as necessary. 

Wt h t he addi t i onal configur at i on i nf or nat i on pr ovi ded t o BN1, i t mis t al s o be updat ed 
to deal appropriately with the configured delays of various lengths. This will required 
updating the finite state machine logic in the routing conponent to deal with varying 
del ay 1 engt hs. This will be a irinor change, but will necessitate adding additional state 
information to the FSM s. Daring the additional delay cycles, some reasonable data or 
pat terns houl d be s ent t o t he conponent not di r ect 1 y connect ed to t he 1 ong wi r e so t hat 
t he s ys t emi s guar ant eed to be well behaved. Af t er t he s t at us and checks umbyt es are s ent 
t o t hi s conponent, t he conponent on what was ori gi nal 1 y t he sendi ng si de of t he 1 ong wi r es , 
will needtosendthis addi t i onal dat a. Thi s addi t i onal dat a coul d si npl ybe arepetitionof 
the status and checksumbyt es . 

3.7 R*ocessors 

Processingelements can be attachedinastrai ght f or war d f as hi on t o t hi s net work config¬ 
ured i n the geometry des cri bed. Conceptually, each net work t erni nal poi nt wi 11 consist of a 
processor, memory, and a cache-control 1 er (See Figure 1.1). Agivennunber of processors 
will be associatedwitheachleaf stack, whet her a bi del t a leaf c^4i$ tBnidmtadir. 

As previ ousl y noted, each leaf stack is rather thin. The processors and their associated 
conponent s for a gi ven 1 eaf can al s o be arranged in a stack structure. The si des of t hi s 
stack structure will be made the same size as the relevant routing stack. The processor 
stack can then be 1 ayered unti 1 it accommodates all t he neces s ary conponent s . Since the 
nunber of processors associatedwithaleaf routing stack is generally about the same as 
t he nunber of r out i ng conponent s i n t he 1 eaf s t ack, t he pr oces s or s t ack s t r uct ur e wi 11 be 
of a thi ckness whi ch i s the same order of nagni tude as that of the 1 eaf routi ng stack. Thi s 
processor stack can then be abut teddirectlytotheleaf stack or evendirectly connect ed, 
Iraki ng i t part of the sane physi cal stack. Si nee the stack structure tends to keep verti cal 
di s t anc es short, this allows the processors to be withinreas onabl e pr oxi nityof the r out i ng 
network. Also since the stacks are thin, this will accommodate space for the processors by 
onl y naki ng t he “wal 1 s ” of t he hoi 1 ow cube ( Sect i ons 3. 4 and 3.5) si i ght 1 y t hi cker. Thi s 
addi t i onal t hi ckness s houl d not be si gni ficant enough t o change t he over al 1 si ze of t he hoi 1 ow 
cube structure appreciably. 

3. 8 Rnit i ng Gbnput at i on 

The arrangement of components and structures descri bed so f ar es s ent i al 1 y det erni ne 
the composition of a routing sequence for this network. In this section, this routing is 
sumrar i zed i ncl udi ng a s cheme f or t he proper gener at i on of t he rout i ng s e que nee. 
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3.8.1 13 sti ligui shi ng a R*ocessor 

Each leaf processor needs to have a uni que s peci heat i on s o t hat it can be referenced. 

In a fat- tree structure, the 1 ogi cal s peci heat i on f or a pr oces s or is its “addr es s ” or 1 ocat 
r el at i ve t o t he root of t he f at -1 r ee. I n t he cas eof afull fat- tree, this will si npl y be t he p 
down the fat- tree to the processor. In a hybr i df at- tree withbidelta clusters as 1 eaves of 
t he f at -1 r ee net work rat her t han i ndi vi dual pr oces s or s , t hi s addr es s wi 11 be t he pat h from 
t he r oot and t he 1 ocat i on of t he pr oces s or i n t he 1 eaf cl us t er. Thus , aprocessor is s peci fie c 
as , L = C o P, where Cis the pat h f r omt he f at -1 r ee root to t he 1 eaf, and Pi s t he 1 ocat i on 
of t he pr oces s or wi t hi n t he bi del t a cl us t er. Eecal 1 f r omSect i on 3. 1. 1 t hat rout i ng out of 
thebideltaleaf cl us ter stack is acconpl i shed when t he hi gh t wo bi t s of t he i ni t i al rout i ng 
byt e are 1’ s ; as such, t he hi gh t wo bi t s of a pr oces s or s peci heat i on wi 11 never be 11. It is 
easiest for routing if (7i s exactly the routing sequence necessary to get fromthe root to 
the de sired leaf. Eecal 1 f r omSect i on 3.3.5 t hat t he 1 ow t wo bi t s of each byt e i n a down 
r out i ng s e que nee t hr ough uni t trees is unus e d s i nc e onl y six bits of r out ingis pe r f or me d on 
t he down r out i ng pat h t hr ough each uni t tree. The val ue of t hes e unus ed bits is irrel evant, 
but f or cl ari t y, they shoul d probabl y be consi dered zero ( 0). 

3. 8. 2 Bout i ng 

The r out i ng s e que nee will be the series of byt es necessary to open a connect i on t o a 
desireddesti nati on. In general thi s consi st s of three part s , the up routi ng sequence, W, the 
down rout i ng s equence, H), and t he r out i ng wi t hi n t he bi del t a cl us t er, W Of course, full 
fat- trees wi 11 not have the iC'conponent. Each of the conponent s of the routi ng sequence 
will occupy an i nt egral nunber of byt es ; a conponent will be padded t o byt e 1 engt h when 
necessary. This keeps the portions of the routing sequence conceptually separated and i s 
consistent with the previous specifications for the placement of swallows (Sections 3.3.5 
and 3. 1. 2). The s e par at i on of t he conponent s of t he r out i ng s equence intotheir own bytes 
al 1 ows t he rout i ng byt es to be generatedwit hout any needtoshift bits around wi t hi n byt es . 
Aconpl et e routi ng sequence, i s si npl y Wo Sh W 

3. 8. 3 Gbnput i ng t he Bout i ng Sequence 

Wth thi s configurati on, we can conput e the neces sar y routi ng sequence Bm th noder- 
at e eas e. Uhl ikethebideltacase, it isnecessaryto know t he s our ce 1 ocat i on i n t he net work 
i n or der to det er ni ne an opt i nal r out i ng s equence; t hi s addi t i onal i nf or nat i on i s neces s ary 
i n or der to expl oit localityinthe fat- tree s t r upfcUr'qo Pefbeli he s our ce pr oces s or 
and L 2 =C 2 0 P be the desti nati on proce%4B«e. conputati on of the routi ng sequence 
then proceeds as follows: 

1. I =C\@ C 2 ; t hat is /is thebit-wiselogical excl us iitaenabjC'jpi’ C 

2. M= a bi t vect or wi t h 0’s for all 1 eadi ng zero byt es and 1’ s for all byt es f r omt he 
1 eadi ng non- zero byt e to t he end; t hat is Ms a vect or whi ch narks all si gni ficant 
byt es. 

3 TMs degenerates to the specific case of full fat trees \hen P 1 = -P 2 =e (the erpty string). 
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IN<7:0> 


Fi gure 3. 4: Byt e- wi de xor SI i ce. 

3. j =1 ocat i on of hi ghes t - or der non- zerobit in 1(1 has ed). 

10 i f bi t 6 or 7 i s hi ghest non- zero bi t i n hi gh non- zero byt e of I 

4. 0= 01 i f bi t 4 or 5 i s hi ghest non- zero bi t i n hi gh non- zero byt e of I 

00 i f bi t 2 or 3 i s hi ghest non- zero bi t i n hi gh non- zero byt e of I 

5. W=( 11) J o (o)o (OM^) 110 ^ 4 ); thetraili ng 0’s si npl y suffte to pad Wto an 
i nt e gr al byt e quant i t y. 

6. E)=\ ow j byt es of C 2 - Thi s wi 11, in fact, i ncl ude unneces s ar y r out i ng i nf or mat i on. 

Howe ver, t hi s quant i t y needs tobe padded to an i nt egr al nunber of byt es i n any case. 

Wen t he bi t - r ot at i ons are wi r ed appr opr i at el y i n t he cr os s over r out i ng pat hs , t hi s 
al 1 ows HX o be gener at ed wi t h ni ni nal cal cul at i ons . 

7. W=P 2 

Not e t hat 'when Abval uat es to all zero, it is the cas£ 4: Idcfild Wand Hj hr e 

unneces s ary; i n thi s case, JEful 1 speci fies the r outi ng t o the final desti nati on si nee t he t wo 
processors are i n the sane leaf cl us ter. 

3.8.4 Inplernntation of (Imputation 

It is enl i ght eni ng t o cons i der possible structures t o efffci ent 1 y i npl enent the conpu- 
tationjust described. In order to talk about individual bytes and bi t s, let us adopt the 
f ol 1 owi ng not at i on: 

• B< n> refers tot he 7th bit of B, wi th the 1 owest bi t bei ng bi t zero. 

• BlVref ers to the N h byte of B, wi th the 1 owest bi t bei ng bi t zero. 

• B1V<7Z> thus refers to the 7th bit of the N h byt e of B. 

The i nt erne di ate quant i ty I, being an xor, is desi gned to be cheap to cal cul ate. 

The value Afre quires only the knowledge of which byte is the first to contain non¬ 
zero bi t s . Thi s al 1 ows all t he bi t s i n each byt e of 1 1 o be or’ ed t oget her for conpar i s on. 
The result is a snal 1 bi t vect or. Si nee t hi s quant ityis thenas nal 1 nunber of bi t s , it is 
possible to let the first non- zerobit sti mil ate the others so t hat a vect or results withal] 
t he s i gni fic ant byt e s nar ke d wi t h a 1 bi t. Fi gur e 3.5 shows this c onput at i on f or a s i ngl e 
bi t inM 
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M<n+1> E - 



Fi gure 3. 5: Gal cul ati on of a Si ngl e H t of M 



-3 0N<1> 


IN<7:0> Q>— 



■<3 0N<2> 


Fi gure 3. 6: Gal cul ati on of a Si ngl e H t of 0 


Both of the conputati on of I and Mian easi 1 y be done ei ther conpl etel y in paral 1 el, 
all of ](7and C 2 at once, or in a byte serial nanner as i s appropriate to a particular 
i npl ement at i on. 

In bo th cases the conput at i on of Ocan be cal cul at ed on each byt e i n Is i mil t aneousl y 
and i n par al 1 el wi t h t he conput at i on of M The val ue of Ms i 11 det er ni ne whi ch conpu- 

t at i on of Os houl d be us e d. The c onput at ion of Ocanproceedsinilarlyto Mml y wi t h a 

gr anul ar i t y of pai r s of bi t s i ns t ead of byt es. Inf act it is necessarytol 00 k onl y at t he hi \ 

t wo pai r s of bi t s i n or der t o det er ni ne 0 I n t hi s case, i ns t ead of s t i mil at i ng t he 1 over 

bi t ( s ) i n t he bi t ve c t or , vie i nhi bit sot hat (N<2 : 1 >is the correct val ue for use i n t he 
cons t r uct i on of BI See Fi gur e 3. 6 f or an exanpl e of t he cal cul at i on of 0 
W can assume for the moment that .fffwill be a single byte quaAt i/tfy.an be 
conputed fromMnd Oi n a coupl e of gate del ays as shown i n Fi gure 3. 7. 

Si nee byt es are s ent byt e s er i al i nt o t he net work, no conput at i on i s real 1 y done for S) 

M dent i fie s whi ch byt e of 2 d?o start 'with 'when it is time to start s endi ng EX nt o t he 
network. Sinilarly, no conput at i on i s done for BI 

liomt hes e si npl e cons tructs, to cansee t hat t he conput at i on of a r out i ng s equence i s 
an inexpensi ve operati on and can be done in a reasonably snail amount of time. A quick 
s i mil at i on of t hi s i npl ement at i on i n a 1 ni cr on CMD6 jfroafeBsd at es Bland M n 
less t han 6ns . 

4 TMs wll be the case until rare than about 50 ihllion processors are connected in this mimer; this 
schem easily generalizes to the mltipie byte case. 

5 the sana processes in\hichIN is fabricated 
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Figure 3.7: Gal cul at i on of -ff/fromAfind 0 


3. 8. 5 Ekanpl e 

Cons i der t he f at -1 r ee bui 11 hefives and t wo 1 evel s o mjgT uni t trees de scribed 
in Section 3. 3. 3. Thi s gives a fat- tree with 786,432 processors; Cwi 11 be 2 byt es 1 ong and 
Pwi 11 be one byt e 1 ong. Leit =P0a:(2D234 and L 2 =0a®4267. 

1. I = OxCC02 © 0xE642 =0x1440 

2. A£=0a03 

3. j= 2 

4. Q = 10; (2=01; si nee A£=0a03, (2 is the correct val ue f or O 

5. W=(ll) 2 o (01) o (tfO^llllOlOO = 0aF4 

6 . S)=QxB42 

7. &=0£7 

Thus the correct routing sequence is fflo Hh SO= Oai4_04267. This will open a 

connect i on out of t he bi del t a s t ack, t hr ough t he fir s t s t age uni t tree, and i nt o 1 e vel 2 of t he 

next uni t tree. At that poi nt, i t has crossed over to the down path so the first byte, RTi s 

swallowed. It routes down through that stack wi th the rel evant portionof the next byte. 

That byt e is t hen s wal 1 owed and t he next byt e, 0a42, isusedto rout e back t hr ough t he 

appr opr i at e uni t tree int he first stage. Thi s final Ehyt e i s s wal 1 owed and t he 0a67 byt e 

is us ed t o rout e t hr ough t he bi del t a 1 eaf cl us t er t o t he final des t i nat i on. 
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4. Aial ys i s 


Chapters 2 and 3 described the cons t r uct i on of TTansit fat-tree networks. This chapter 
quantifies some characteristics of the resulting networks. Considering fat-tree networks 
wd t h bot h [^ 64 ^ leaves and bi del t a cl us t er s leaves, a progressi on of networks fromfull 
bi del t a net works tofull fat- trees can be conpar ed and anal yzed. Section4. 1 describes the 
di s t r i but i on of wi r es by 1 engt h i n hoi 1 owcube configur at i ons . Sect i on 4. 2 ext ends t he bi del t a 
net work s t r uct ur e beyond t he si ngl e stack size for the pur pos e of conpar i s ons . Section4. 3 
s uimar i z e s net wor k size free dons for t he var i ous net wor k c onfigur ations. It t he n uses size 
as a par arret er t o quant i f y har dwar e r equi r ement s and net wor k 1 at enci es . Sect i on 4. 4 us es 
the development of the previous sections t o pr ovi de d s ome concret e net work exanpl es for 
conpar i s on. Fi nal ly, Section4.5uses si npl e pr obabi listic mode Is to pr ovi de basi c rout i ng 
statistics for the networks of Section4.4. 

4. 1 Wre Lengths 

Wre lengths are a si gni ficant concern i n bui 1 di ng 1 ar ge net wor ks . As t he s ys t emgr ows , 
t he nunber of clock eye les r equi r ed t o rout e bet ween uni t trees becomes t he doni nant com 
ponent of net wor k 1 at ency. The hoi 1 owcube structure (Section3.4) keeps wires moderately 
short considering the magnitude of the wire convergence that mist occur. This section 
quant i fies t he di s t ri but i on of wi r es by 1 engt h i n hoi 1 owcube geomet ri es . 

4. 1. 1 "West Oise 

Utilizing free - space wi r i ng in t he hoi 1 ow cube, t he 1 onges t wires will be those t hat 
traverse the cube’s diagonal. These wi res ^Blii ires the length of the cube side 
long. The length of the cube side depends on the size of convergence for the cube. As 
seeninSection3.4.4, for st andar d hybr i df at- trees, thesides will bef our bi del t a 1 eaf s t; 
si de 1 engths 1 ong at the first convergence and i ncrease by a factor of four at each successi ve 
stage. Ril 1 fat- trees sinilarly pr ogres s f r omsi des of 1 engt h e^jplsi drf ImiqpZhs 
at the first convergence and progress by a factor of four at successive stages. The length 
of the side of a stack is det er m ned by t he naxi mimnunber of routing conponents on a 
s i ngle r outi ng s t age. 

4.1.2 13 stri but ion of Lengths 

Chi y a snal 1 f r act i on of t he t ot al wi r es wi t hi n a hoi 1 owcube ar e of t he wor s t case 1 engt h. 
Perhaps a more i nteresti ng metri c is the di stri buti on of wi res by thei r 1 ength. 

For s i npl i ci t y, let us nor nal i ze wi r e 1 engt hs t o t he 1 engt h of t he s t ack si des . Thi s al 1 ows 
t he der i ved di s t r i but i ons t o be appl i ed general 1 y r egar dless of stack size. The normal i zi nj 
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1 engt will be the length of aside of the conponent stacks at the terninal level of 

the fat-tree s wS/1 1 denot e t he nunber of s t acks al ong t he s i de of cube. 

Each i nt er connect i on wi t hi n t he hoi 1 owcube connect s f r oma “si de” to t he “t op” of t he 
cube ( See Fi gures 3. 1 and 3. 2). Cbnnecti on endpoi nt s are evenl y di stri buted across the two 
di nensi ons conpos i ng each t he si des of t he hoi 1 owcube; this results si nee eachstackinthe 
si des has t he s ane di stri but i on of i nt er connect i on t o t he top. Each of t he s t acks on t he 
hoi 1 owcube si des wi 11 connect t o endpoi nt s whi ch ar e evenl y di s t ri but edacross the surface 
of t he top of t he cube. Ik i ng t hi s uni f or ni t y, we can easi 1 y char act erize the wire length 
di stri but i ons . 

W can s t ar t by de conpos i ng t he di s t r i but i on i nt o s epar at e di mens i onal conponent s , a; 
y , and z. Si nee these di stri buti ons are i ndependent, we can t hen f or mt he t ot al distribution 
f romthe product of the di nensi onal di stri buti ons . Tb expres s the di nensi onal di stri but i ons , 
a c oupl e of si npl e pr obabi litydistri but i on ar e nee de d. 


Uii for mliii di nans i onal 13 stri but ion 

Che case of interest occurs when the distribution is conpletely uni f ormacross the 
possible space of lengths. For such a case, we have a uni f or m di s t r i but i on. Thus the 
di stri buti on functi on i s si npl y that of Equati on 4. 1. 


Px(%)) 


^ 1 < x 0 < N s 

0 otherwi se 


(4.1) 


lirifromEifFerence 13 stri but ion 

The other case of i nterest occurs when the di stri buti on i s that of the di fference between 
t wo val ues pi eked r andoid y f r omt he s ane uni f or mdi s t r i but i on. In this case, we have a 
di stri but i cy(£b) as i n Equat i on 4. 1, and we want t he di s t r i but i on f or t he qpanl|i t y \x 
where 3^ and aq are descri bed by.pThi s di stri buti on i s descri bed by Equati on 4. 2. 

^ x 0 =0 < N s 

2{ \] o) 0 <x 0 <N s ( 4 . 2 ) 

0 otherwi se 


Pdx ( ) A 


H)11 owClibe Dstribution 

Wt h t he si npl e di s t ri but i ons j us t des cri bed, t he t ot al di s t ri but i on f or wi r e 1 engt hs i i 
a hollow cube can be derived. The distance the wire mist traverse in the vertical ( z) 
di nensi on wi 11 be descri beefeb^dpee al 1 i nt er connect i ons connect f r omt he t op t o s one 
distance down the side of the cube. Sinilarly, the distance the wire mist traverse “into” 
t he cube, nor rial t o t he s ur f ace of t he s i de, wi 11 be det er ni ned s ol el y by t he 1 ocat i on of t he 
desti nati on on the top surf ace. Thi s di nensi on, whi ch f or the current devel opnent wi 11 be 
cal led can also be de scribed fey jin the r enai ni ng di nensi on, whi ch wi 11 be referred 

1 NB In an abseil ute fram of reference, this (inansi on would be the x dinansianfar tw> faces andthe 
y cEnansionfor the tw> faces adjacent to those. 
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Figure 4.1: II s t r i but i on of Nor nal i zed Wr e Lengt Jrs=floa:n^l6, Bespecti vel y. 
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Thble 4. 1: Expect ed Wr e Lengt hs Nor nal i zed t o St acL^^e, l 


t o as Xf t he di s t ance t r aver sedbythe wire depends on t he r el at i ve pi ace me nt of bot h t he 
s our ce and des t i nat i on of t he wi r e and t hus wi 11 be des cgj.be d by p 

Wt h t hi s under s t andi ng of t he di mens i onal di s t ance di s t r i but i ons , vie can des cr i be t he 
wi re 1 ength di stri buti on by Equati on 4. 3. 

max( l )n»x( l ) n»x( l ) 

Pi(h) = Y, £ £ ^( z ) ' j(ifl ■ ®( cfc ) ' ] ( 4 -3) 

z=0 y —0 dx —0 


nax( l ) = V%N S 


(4.4) 


The f unct i on pi s si npl y used to deternine whet her or not to i ncl ude t he pr obabi 1 i t y for 
a gi ven (d^y,z) conbi nati on i n the sumand i s descri bed by Equati on 4. 5. 


wi(b,<%y,z) = 


1 h = [Vdb 2 +t/ 2 +z 2 ] 

0 otherwi se 


(4.5) 


The di stri buti oq, i$> easily conputed for a given val u§ aficLZ^ives the resulting 

1 engt hs in uni t s pfc//. Fi gur e 4. 1 shows t he di s t r i but i qn£o®fl^>= 4 and 16, t he 

first t wo si de 1 engths f or hoi 1 owcubes . The expected, or average, wi re 1 engths deri ved from 

t hi s di s t r i but i on ar e s hown i n Thbl e 4. 1. 


4.1.3 13 stri but ion by Efel ay 

Chce vie have t he nor nal i zed di stri but i ons , it is easyto convert t hes e i nt o del ay di s t r i - 
but i ons f or a parti cul ar leaf s t aqj^i z3jii Is , of cour s e, is t he i npor t ant performance 
metric. 
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The previ ous di stri buti qnpapi be converted i nto del ay di stri buti on when the cl ock 
cycle and stack size are given. Wth cl ock c c<yHtdd sta £fc, the nunber of delay cycles 
bet ween s t ages is relatedtothe wire del ay by Equat i ons 4.6. 


^cycl es (b) = 


tdel ay(b ) 


(4.6) 


Ibi ng an appr opr i ate no del for t he pr opagat i on t i me of asi gnal acr os s a gi ven 1 engt h of 
wire, to can relate the nunber of del ay cycl es di r ect 1 y t o t he nor nal izedowire 1 engt h l 
Equat i on 4. 7 gi ves t hi s r el at i on as sum ng t he si gnal is free to pr opagat e at t he s peed of 
light, c, as woul d be the case for optical interconnect. 


(4.7) 

Wththese relati ons , t he wi r e 1 engt h di s t r i but i on can be convert edinto a distri but i on f or 
the nunber of del ay cycl es between stages i s computed as Equati on 4. 8. 


nax( l ) 

Pn(no) = Y \pi(b) ■ u(no,lo)] (4.8) 

o 

Chce agai n a sel ecti on f unct i^nj),^), is used to sel ect the appropri ate 1 engths whi ch 
correspond to a gi ven del ay. 


w n (no,lo) 


1 no — ["n C y C / e5 ( £))"] 
0 otherwi se 


(4.9) 


Current projections na)ceRitlOns ( Secti on 1. 3). Eecall t hat i 4 % , W’j, t ack ~ 1 / 

(Secti on 3. 3. 4). Equati on 3. 12 gives a rough approxi nati on of 1 eaf stack si ze for bi delta 
clusters. Ibi ng these val ues , s e ver al di s t r i but i ons f or a hybr i d and full fat- tree hoi 1 owcu 
s t r uct ur e are shown i n t he f ol 1 owi ng figures . Fi gur e 4.2 is the delaycycle distri but i on i n 
a hybr i d f at -1 r ee hoi 1 ow cube usj^^ d&ves . Fi gur e 4.3 describes the distri but ion for 
both hybri d fat-trees usijiq^ihves and full fat-trees us l^sgg ffl ave s. Figure 4.4 
c or r e s ponds to the distri but i on f or a hybr id fat-tree holl o wi<$ulb®awe t h Bhe s e 
figur es al 1 par al 1 el Fi gur e 4.1, but t he di s t r i but i ons here are gi veninteriu of cycl es of del a 
Thbl e 4.2 gives the expected nunber of del ay cycl es for t hes e configur at i ons . 


4. 2 Large Hdelta Ffet works 

For conpar i s on, it is necessaryto consi der bi del t a net works scaledto sizes conpar abl e t o 
t he f at -1 r ee net works under cons i der at i on. However, si nee we cannot s i npl y bui 1 d ar bi t r ar i 1 y 
largestacks (Secti on 1.4), largebidelta net works mis t be deconpos ed i nt o mil t i pi e stack 
s t r uc t ur e s. 
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Figure 4.2: II s t r i but i on of Del ay Cycl § s=0oanJ5F 16, respecti vel y Ffci jglleaves 



Figure 4.3: II s t r i but i on of Del ay Cycl <|s=£4)ai]ftt 16, respecti vel y Ffci gglfiaves 



Figure 4.4: II s t r i but i on of Del ay Cycl §s=f4oan3r 16, respecti vel y Ffci ijiglfiaves 
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Tbbl e 4. 2: Expect ed Nunber of Del ay Cycl es 
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4.2.1 Hdelta Stacks 


Perhaps, the ideal size for the constituent stacks is four routing stages. Wthfour 
s t ages of rout i ng, the stack di sti ngui shes 256 1 ogi cal destination. Thus each s t ack per f or m 
a byt e’ s wor t h of rout ing; afresh rout i ng byt e can be usedto rout e t hr ough each stack. 

The final size of sue ha configur at i on wi 11 t hen depend on t he nunber of out put s pr ovi ded 
i n each of t he 1 ogi cal directions. The snal lest configur at i on woul d pro vi de t wo out put s per 
logical direction. In general, such a configuration would only be desirable as the final 
routi ng stack conposi ng a 1 arge bi del ta network. In order to provi de proper f aul t tol erance 
with two outputs in each logical direction, the final stage of the network mist utilize 
the alternate BN1 configur at i on where each crossbar provides a single output per logical 
direction (Section 1.5). Since this routing stage’s performance is inferior to that of t] 
other stages, i n or der t o achi eve opt i nal routi ng per for nance onl y the final routing stack 
shoul d concent rat e t he nunber of out put s ineachdirectionto two. 

Beinglinitedtoat most 64 conponent sin each rout i ng s t age (Section3. 3. 1), we are not 
free to consi der bi del t a s t acks t hat hot h di sti ngui sh 256 di ffer ent des t i nat i ons and provi de 
more t han t wo out put s i n each 1 ogi cal di r ect i on. Ingeneral if a stack di sti ngui s hes nl ogi cal 
destinations and has / outputs i n each 1 ogi cal des t tmaotintdiii^ Afonponent s will nake 
up each rout i ng s t age as gi ven by Equat ion 4. 10. Thi s r el at i on f ol 1 ows t r i vi al 1 y f r omt he 
fact that each EN1 conponent has eight outputs. 

*c=V Z (4 ' 10) 

As such, we are onl y f r ee to consi der s t acks in whi chn- Z < 512. For cl ar i t y bi del t a s t acks 
wd t h par amet er s nand / wi 11 be r ef er ence^^as Iffromt hi s di s cus s i on, we can concl ude 
that -§ 56:2 , I&iig , and stacks should be used only as the final stage of stacks in 

1 arge bi del ta networks . The 1 argest stack reasonabl e for use in f orning the earl i er network 
s t ages i s th% 4 $ . 

4.2.2 Aranging Hdelta Stacks 

The bi del t a st acks are t hen t reat ed as the s t andar d r outi ng uni t s . They are arranged 
into stages in or der to bui 1 d 1 arge r bi del t a net works . Wt hi n t he bi del t a s t acks , t he cl ock 
cycl e can be as f as t as a s i ngl e s t ack cons t r ai ns it t o be. Addi t i onal s t age del ays wi 11 be 
incurredbetween stack stages as is the case withf at- trees becaus e the wires will be 1 ong. 
However, this extra delay is onl y i ncurred bet ween stack stages and at the final recycle 
pat h. The 1 ong wires can be dealt withas describedinSection3.6. Thi s configur at i on wi 11 
gi ve better per f or nance t han uni f or id y si owi ng t he cl ock rat e down everywhere in or der 
t o accomrodat e t he pr opagat i on del ay acr os s t he 1 ong wi r e pat hs . 

For replication, it will be neces s ar y t o pi ace swallows between the stages of routing 
s t acks . I n t hi s nanner, a s e par at e byt eisusedto rout e t hr ough each bi del t a stack. 

Si nee the nunber of i nt er - st age del ays does not affect the t ot al cl ock rat e of the s yst eip 
it nakes sense to inter-wire the stacks in an indirect binary cube network style. This 
al 1 ows the wires at the first fewst ages to be relatively short. Succes si vel y 1 onger wi r es ar 
used between successive st ages of rout i ng s t acks . Thi s structure can be laid out i n t wo 


55 



Figure 4.5: Indirect Binary Gibe Tbpology 


di mensi ons so that the cri ti cal geometri c parameter at each stage i s the square root of the 
nunber of stacks i nvol ved i n t he convergence; that is the subsets of the routing plane that 
will be interconnecting between successive levels will be squares of stacks. At the final 
stage, t he convergence wi 11 be one bi g s quar e; at earlier stages, there will be nany snal 1 er 
s quar es of convergence. Fi gur e 4. 5 shows t he i nt er connect i on t opol ogy of an i ndi r ect bi nary 
cube usi ng bi nary s wi t chi ng conponent s . Usi ng bi del t a s t acks for s wi t chi ng, each s t ack wi 11 
switchinl6, 64, or 256 di recti ons , and t he i nt er - s t age wiringwill occur in twodi nensi ons 
rather than one. 


4. 2. 3 Wre Lengt hs 

W can appl y the anal ysi s of Secti on 4. 1 to anal yze the di stri buti on of wi res by 1 ength 
inthis sc he me as well. Here we have a s equence of rout i ng pi anes wi t h t he i nt er connect 
bet ween pi anes . Inthis configuration, the wires are di s t r i but ed wi t h var i ous displacements 
a:and i/di r ect i on and an es sent i al 1 y cons t ant di spi acement bet ween pi anes . As suni ng t he 
i nt er - pi ane di spl acement i s negl i gi bl e, we can get a feel for t he di s t r i but i on of wi r e 1 engt 1 

Here the distribution of displacements necessary in the a:and y directions are, to a 
first-order appr oxi nat i on, uni f or mdi ffer ence distributions as descri bed i n Secti on 4. 1. 2. 
Ikomthi s , we can easi 1 y deri ve the di stri buti on of wi re 1 englb<hst.hcLifcten^th of the 
si de of the square of convergence at the stage of intere^ijs aogiaciniahFzed to stack 
size, i pm 11 agai n be our di stri but i on f unqtwbhl he the selectionf unct i on as before. 

Equati ons 4. 11 through 4. 13 sumrari ze these rel ati ons . 


Pl(b) = 
nax( l ) = 


ie/( {),(£; d/) = 


nax( l ) nax( l ) 

E E [p<fe(cfc) • ■ w(b,d$dj) ] 

dy=0 dx -0 

[V21V S ] 

f 1 h = \\/<k 2 +dj 2 ] 

1 0 otherwi se 


(4.11) 

(4.12) 

(4.13) 
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Figure 4.6: II stri buti on of Normal i zed Wre Lengt Jrs=8oacn^64, Eespecti vel y. 



Figure 4.7: II s t r i but i on of Del ay Cycl eg =£ <8r aNd 64, Eespecti vel y usi fig< 2' 
St acks. 


Fi gur e 4. 6 shows t he di s t r i but i ons f or nor nal izedside 1 engt hs of 8 and 64. If a 1 ar ge bi del t a 
net work were bui It withthree stack st ages , t wo bui fe4^f ramLBhe 1 as t f r onj^g# , 

N s woul d be 8 bet ween t he first and second s t age and 64 bet ween t he second and t hi r d. 

Normal i zed average wi r e 1 engths are 4.6 and 33. s 9=f8mnftt 64, respecti vel y. 


4. 2. 4 Efel ay Qrcl es 

The nunber of delay cycles required can be deternined exactly as described in Sec¬ 
tion 4. 1. 3. Mki ng t he s ane as sunpt i ons as i n Sect i on 4. 1. 3, Fi gur e 4. 7 par al 1 el s Fi gur e 4. 6 
interns of delaycycle uni t s . The si de 1 engt h f or l^ifeh iihd -B 2562 s t acks is t wo 
feet. For this configuration, the average nunber of del ay cycles for i s s=18 Smhen N 
6. 9 when N =64. 


4. 3 ]>ht work Qiaract eri zat i ons 

Thi s secti on suimari zes a f ewquanti zati ons for vari ous network characteri sties par am 
eterized by net work size. Quant i zat i ons are pr ovi ded f or f ul 1 f at -1 r ee, hybr i d f at -1 r ee, ai 
bi del t a net works . Thi s al 1 ows s one conpar i s on acr os s t he range of net works bet ween f ul 1 
fat-tree and full bi del t a net works . For each network, characterizations are provided for 
requi red hardware and network 1 atency. 

Thbl e 4. 3 suimari zes most of the vari abl es used i n the renai nder of thi s secti on. 
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N c 


Nleaf 


N t 


c leaf 


Nbi del ta 


N, 


vmt 


N. 


fvrdt 


h 


$chip 


71, 


(m,o) 


J open 


J connect 


nunber of processors 


nunber of routing conponents 


nunber of processors supported by a 1 eaf stack 


nunber of rout i ng conponent s i n a bi delt a 1 eaf s t ack 


nunber of bi delt a 1 eaf cl us t er s 


nunber of IT 64 $ unit trees 


nunber of UTg 4 % unit trees 


a par anet er describing selection f r eedoiq 
i r epr es ent s t he nunber of stack stages in a net work 
i mis t be a pos i t i ve i nt eger 


a par anet er describing selection f r eedoiq 

j r epr es ent s t he nunber of r out i ng s t ages i n a bi del It a stack 


total nunber of routi ng stages i n a bi del ta network 


1 engt h of a s i de of t he r out i ng chi p 


cl ock peri od 


length of longest wire 


wi r e del ay 


speed of 1 i ght 


stage delays between stage m and o 


1 atency openi ng connecti on fromsour ce to desti nat 


on 

1 at ency of an open connecti on f r oms our ce to desti nht i on 


N.B. For t he f ol 1 owi ng, 1 at ency is used to refer tot he period of tine betweenthe tine 
the ires sage enters the network and the ti me the ires sage arri ves at the desti nati on. The 
actual latency of a network operation will be a function of this metric and depending 
on the end to end protocol used. Connection 1 at , Xdi ffer s fromthe latency 

openi ng a connecti due totheneedto change rout i ng byt es dur i ng t he openi ng of 

a connecti on. 


Thbl e 4. 3: \hri abl e Suimary 


4.3.1 Rill Iht-Ttee 

Be cal 1 f r omSect i on 3. 3. 4 t hat full fat- tree net works ar e bui It entirelyfr omuni t trees 
Ril 1 fat- trees have pr oces s or s as t he ul t i nat e 1 eaves of t he f at -1 r ee s t r uct ur e. 


Si ze 

Ril 1 f at -1 r ee net works are bui 11 i^tgh W.i t trees forning the hot t oimos t s t age 
of the fat-tree unit trees forning the renainder of the tree structure. Equa¬ 

tion 4. 14 char acterizes the sizes of cons tructible full fat- trees. 

N p =64* (4.14) 
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Afull fat- tree net work of agivensize will have 3it ree levels nade f r omi stages of uni t 
trees. The fat- tree will be conpos ed of i up rout i ng stages and 3 i down rout i ng stages. 


Hardware Bsqui reliant s 

Each IT 648 uni t tree r equi res 208 BNl rout i ng conponent s (Section3.3. 1) whi 1 e each 
IT 64 ^ r equi res 52 EN1 conponent s (Section3.3.4). Equat i ons 4.15 and 4. 15 respectively 
des cr i be t he r equi red nunber ofi^Fand IT 64 « uni t trees ne cess ary to bui 1 d f ul 1 f at - 
trees of si zg (/'Secti on 3. 3. 4). 

N 

Nfvrit = gf (4-15) 

N^ t = (l-4(^))^ (4.16) 

Cbnbi ni ng these requirements, we find the total nunber of routing conponent s required 
as expres sed i n Equat i on 4. 17. 

N < =<»)(£) +<M8)(l - *»>) ()|) = (4 + ^4^) (4.U) 


Latency 

As describedinSection3.6there will be s one nunber of s t age del ays bet ween stack 
s t ages . Let^,^ be t he nunber of cl ock cycl es of del ay bet ween s t ack s t age rrand s t age 
a Between any pair of stages, the val ue^ jp^rand will be distributed as 

describedinSection4. 1. Si nee there is adistri but i on of pos si ^ £_^ab«atewefe<mr n 
the sane stack stages, no single value describes this quantity. The reverse stage delays, 
k) ™ 11 be di s t r i but ed si ni 1 ar 1 y; however, a pa^t^^jilnffiEchnot be r el at ed t o 
the corr espondi n^y^-ti) utilized as part of the sane interconnection. 

Tb rout e t hr ough a gi ven 1 evel of t he f ul 1 f at -1 r ee, a connect i on mis t rout e t hr ough one 
up rout inglevel for each31evels upthe tree. The connect i on wi 11 t hen r out e t hr ough al 1 
t he down 1 evel s . Bet ween s t ack s t ages on hot h t he up and down pat h, the route will suffer 
t he addi t i onal s t age del ays due t o t he 1 ong wires be tween stages. The 1 at ency t hr ough level 
k of tree i s gi ven by Equati on 4. 18. 


L 


corvnect 


(k) 


+ k+ 


rfi^ rfbi 


\ 




X t c 


(4.18) 


In openi ng a connecti on through 1 evel A; an additional cycle of del ay i s i ncur r ed bet ween 
s t acks on t he down r out e inorder to s wal 1 ow t he 1 eadi ng rout i ng byt e. Si ni 1 ar 1 y, at 1 eas t 
one up rout i ng byt ewill need to be dropped in crossing over t o t he down rout i ng pat h; 
every fourth stage between uni t trees requi res an addi ti onal stage of del ay for the 1 eadi ng 
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rout i ng byt e t o be dropped. The 1 at ency to open a connect i on t hr ough 1 evel kol t he f at -1 r ee 
is suimari zed i n Equati on 4. 19. 


J open 


(k) 


— L connect 

(. 


■ k- 

A, 

■ k - 

3 

_1 J + 

12 


X t r 


V 


k ■ 
12 


+2 


rfbi 


rfbi 


+k-l+ ^2 n (j,j- H)+ X] 


X t c (4.19) 


The best - case rout i ng occur s when thenost localitycanbe expl oi t ed. Thi s occur s when 
a processor connects to its nearest neighbor. In this case, the route occurs through level 
one of t he f at -1 r ee. Thus Equat i ons 4.20 and 4.21 describe the best-case per f or nance f or a 
full fat- tree. 


connect 2 t c (4.20) 

Lopen = 3t c (4. 21) 

I n t he wor s t - cas e, t he onl y coimon ances t or of t he s our ce and des t i nat i on pr oces s or s 
is the tree root. Inthis case, the connect i on mis t t r avel all i s t ages up t o t he r oot of t he 
fat - tree and t hen all 3ist ages back down to t he des t i nat ion 1 eaf. Equat i ons 4.22 and 4.23 
give the worst-case latencies for the full fat-tree. 

( i 4 i 4 \ 

Xy n (l-4.1)J x (4.22) 

3^ 3 ^ / 

i 4 

3 =1 


i 4 


+o+ ! x tr 

3 =1 


(4. 23) 


J open 


+5 i - 1 + 


4. 3. 2 Ijrbri d Ikt - Tfree 

Ifybr i d fat- trees are bui It withbideltaleaf clusters forningthe 1 eaves of t he f at -1 r e 
s t r uc t ur e. 


Si ze 

The 1 eaf s t age of t he net work i s bui It wi t h t he bi del t a 1 eaf cl us t er s des cr i bed i n Sec- 
t i on 3. 1. These ar e i nt er connect edwitha fat- tree structure cons t r edifiitomCF 
trees. Each leaf cluster is built from) routing stages andj^ppomAes^\6rs as 
des cri bed by Equat i on 4. 24. 

N leaf = 3- 4 J ' 4 ) (4.24) 

For t echni cal re as ons describedinSection3. 1.3, 1 <j < 4. Thi ng i — Is k'ftges of IT 
uni t trees, t he t ot al nunber of pr oces s or s support ed i s t hus gi ven by Equat i on 4. 25. 

N p =N Uar 6^ 4 ) =3- 4 J ' 4 ) • 64 i4 ) (4.25) 

The fat- tree structure pr ovi des i — 1 up r out i ng s t ages and 3 (z — 1) tree levels and hence 
down r out i ng s t age s . 
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Hardwire Bsqiii reran! s 


Each leaf is constructed fr^n^ARNl routing couponents as des cri bed by Equa¬ 
tion 4. 26. 

N Cleaf =4( J ' 4 ) +(j— 1)3 • ^ =(3j+l)4(^) (4.26) 

The total nunber of bi delta leaf cluster required is described by Equation 4.27 (Sec¬ 
tion 3. 3. 3). 

N bide i ta = A=64( i4 ) (4.27) 

J-'leaf 

The nunber of uni t trees requi red for a hybri dfat-tree wi th i — 1 uni t tree stages i s gi ven 
by Equation 4. 28 (Section3.3.3). 


• = (3l 


N •, = 

1 ’ i/ra t. 


V576 


^ (l - 4 (14) ) 


(4.28) 


Each IT 648 uni t tree is cons t r uct ed f r om208 BN1 conponent s (Section3. 3.1). Cons ol i - 
dat i ng t hes e r el at i ons , we get Equat ion 4. 29 whi ch des cri bed t he t ot al nunber of rout i ng 
conponent s i n a hybri d f at -1 ree. 


N c 


(3j+l)4(- ?4 ) • 6^ 4 ) +208 (l -4( 14 )) 


(3j+l)4 j ^ + 



4d 4 ) — 4( 14 )^ 


6 4( i4 ) 


(4. 29) 


Latency 

The best-case interconnect in the hybri d f at -1 ree occurs when a connect i on i s nade 
wd t hi n t he bi del t a 1 eaf cl us t er. Wen this level oflocalitycanbe expl oi t ed, t he 1 at ency f < 
a connect i on i s gi ven by Equat i on 4. 30. 

-b connect —L open — J X t c (4.30) 

Openi ng and conti nui ng a connecti on are i denti cal i n thi s case because j is constrai ned to 
not exceedfour due to current t echnol ogy 1 i ni t at i ons . 

Wen coimuni cat i ons do need t o occur bet ween pr oces s or s i n di ffer ent 1 eaf clusters, 
connecti ons mis t be nade t hr ough t he fat- tree. As connecti ons are rout ed between stack 
stages, delaycycles will be incurredas describedfor the full fat- tree (Section4. 3.1). 
r out i ng t hr ough some stage of the fat- tree, the c onne ctionfirst re qui res one cycle to r out e 
i nt o t he f at -1 ree net work. The connecti on t raver s es one up rout i ng s t age every t hr ee stages 
up the tree it mist go. Chce the connect i on r eaches the root of the smallest coimon 
s ub- tree betweenthe s our ce and des t i nat i on pr oces s or, t he connecti on mis t t hen be r out ed 
down all t he down rout i ng s t ages tot he desiredbideltaleaf stack. Chce t he connecti on 
final 1 y ent er s t he des t i nat i on 1 eaf s t ack, t her e wi 11 be an addi t i onal j s t ages of rout i ng wi t hi 
the 1 eaf stack. Equati on 4. 31 suimari zes the 1 atency of a connecti on through 1 evel k of the 
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fat-tree network. 
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connect 


(k) 


+j+k+ 


rfi 


El n (rrirr$l) 

mi 


rfi 

“f El n (m (1, r?) 


\ 

X t c 

/ 


(4.31) 


Wen openi ng a connect i on, additional delaycycles ar e r equi red due t o t he need t o change 
r out i ng byt es . The hybri d fat-tree requi res t he addi t i onal cycles for s ml 1 owi ng byt es on 
the up routing path, down routing path, and in the crossovers as described for full fat- 
trees (Section4. 3.1). Si nee t wo bi t s of rout i ng are requi r ed t o rout e into the fat- tree tli 
s ml 1 owon t he up pat h wi 11 occur one stack s t age earlier inthis configur at i on t han i n t he 
full f at- tree configur at i on. The hybri d f at -1 r ee wi 11 requi re an addi ti onal change of rout in 
byt es when t he connect i on ent er s t he des t i nat i on 1 eaf cl us t er. Equat i on 4. 32 conbi nes t hes e 
additional delays t o des cr i be t he 1 at ency f or openi ng a connecti on through 1 evel k of the 
fat- tree portion of a hybr i d fat - tree. 
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(4.32) 


Wr s t - case connecti ons i n t he hybri df at- tree occur when i nt er connect i on mis t be nade 
t hr ough t he t op 1 evel of t he t op s t age of uni t trees. Here the connecti on i s nade i nt o t he 
f at -1 r ee, up t he i— 1 up rout i ng stages to t he r oot, down the 3(i — 1) st ages toaleaf cl us ter, 
then across the j stages of the 1 eaf cl us ter to the desti nati on processor. Equat i ons 4. 33 and 
4. 34 des cr i be t he 1 at ency for t hi s wor s t - cas e i nt er connect i on. 


J connect — 
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= l+j+5(t-l) 
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El n (w,m (1) “f El n (mtl,r7) 
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(4.34) 


mi 


mi 


4.3.3 H del t a hfet work 

Section4. 2 describedthe cons t r uct i on of 1 ar ge bi del t a net wor ks . Thi s is a general i z at i oi 
of t he TT ans it bideltanetworkto mil tiple stack sizes. 


Si ze 

The cons t i t uent s t ack s t ages are each cons tructedwdthjl ayer s of rout ing. istackst ages 
can t hen be conbi ned t o cons t r uct t he 1 ar ge bi del t a net wor k. The f s for each s t age i n a 
s i ngl e net mr k need not beidentical acrossall stackst ages . Thus t he nunber of pr oces s or s 
i n such a net work is des cri bed by Equat i on 4. 35. 

N p =4 jl ■ 9- ■ 4 (4.35) 
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Be cal 1 f r omSect i on 4. 2. 1 t hat size and per f or nance cons t r ai nt s 1 i ni t j as gi ven by Bel a- 
t i ons 4. 36 and 4. 37. 


f or mci: 1 < j m < 3 (4.36) 

2 < j; < 4 (4.37) 

The t ot al nunber of rout i ng s t ages i n t hi s net work i s sumrar i zed i n Equat ion 4. 38. 


h=ji+j 2 +- ■ ■ 4-j=log4(N p ) 


(4.38) 


Hardwire Bsqui reliant s 

The nunber of r out i ng conponent s needed can be det er ni ned by 1 ooki ng at t he ent i r e 
net wor k wi t hout a need t o do i ndi vi dual accounting at each s t ack s t age. Cons i deri ng the 
bandwi dth i n and out of the begi nni ng and end of the network, Equati on 4. 39 gi ves the 
nunber of routing conponent s needed at each routing stage. 

(4.39) 

Cbnbi ni ng t hi s wi t h Equat i on 4. 38, t he t ot al nunber of BN1 rout i ng conponent s r equi red 
for t he bi del t a net work wi 11 be det er ni ned by Equat i on 4. 40. 


N 

N c = hx — 

4 

_ N p log(i^) 

4 


(4. 40) 


4. 3. 4 Lat ency 

Del ay stages are i ncur red bet ween sue cess i ve stack stages as previ ousl y descri bed. Sec- 
tion4.2.3 de scribes the distri but i on char act eristics the extradelaycycles r equi red bet we 
stack stages. Since the set of source processors is the sane as the destination set, con- 
necti ons mist 1 oop- back fromthe end of the bi del ta network to the begi nni ng. Si nee thi s 
requi res onl y transl ati on i n the verti cal stack di nensi on, we wi 11 assume thi s 1 oop- back can 
occur i n a si ngl e cl ock cy^^^.t Nate that si nee routi ng onl y occurs i n one di recti on 
t hr ough t he bi del t a net work, uni i ke t he tree s t r uct ur es , t he del ays between stack st ages i s 
onl y i ncurred once i n thi s structure. 

Wth the excepti on of the nunber of del ay cycl es i ncurred between stages due to 1 ong 
wires, all paths through the bi del t a net wor k ar e the same length. Each path traverses 
t he hr out i ng stages des cr i bed above. Int er - s t age del ay is i ncurred as well as acycle for 
the final 1 oop-back. As such, t he connect i on 1 at ency f or a bi del t a net wor k i s describedby 
Equation 4. 41. 

Lcormect n (k, fc-p) +4 ^ X t-stack (4-41) 

I n openi ng a ne w connect i on t hr ough t he bi del t a net wor k, additional 1 at ency occur s as a 
result of changing routing bytes. Bouting bytes will be changed bet ween routi ng stack 


63 



Ninber of 

Uii t Thee 

Tbtal 

Hocessors 

Stacks 

Gbnponent s 

64 

1 IF 64* 

52 

4, 096 

64 IF 64^ 4 IF 64* 

4, 160 

262,144 

4096 IFq4% 320 IF 

279,552 


Thbl e 4. 4: Hill Fat - Hee Har dirar e Bequi rement s 


stages as describe inSection4. 2.2. Hie t o t he cons t r ai nt s pi aced on j, t her e wi 11 never 
be a need t o change r out i ng bytes wi t hi n a bi del t a s t ack. Equat i on 4. 42 gi ves t he 1 at ency 
requi red to open a connecti on through the bi del ta network. 


-'open — L connect 4 _ ( * 1) X tjfocfe 

— ^ k, fe-fl) ^ X tjfocfe 


(4. 42) 


4.4 Gbnpari sons 

I n or der to offer a cl ear er conpar i son between the var i ous net works des cr i bed her ei n, 
this secti on provi des sons concrete exanples. Thing the equations and as s unpt i ons from 
Secti on 4. 3 and the geonetri c configurati ons descri bed i n Secti ons 3. 4 and 4. 2, represen¬ 
tative nunber s are deternined for network size, composition, and latency. All values are 
char act eri zed as describeinSection4. 3. 


4. 4. 1 Ril 1 Iht - See 

Thbl es 4.4 suimar i zes t he har dwar e requi rements for afewfull fat- trees of interesting 
sizes. Thbl e 4.5 characterizes the extreme latency cases for the full fat-trees described 
Thbl e 4.4. The wor st-case 1 at ency incl uded here uses t he wor st-case wire del ays bet ween 
uni t tree s t ages on hot h pat hs t hr ough t he f at -1 r ee and as sunes t he connecti on mis t be 
made t hr ough the tree root. Bevi ewSect i on 4. 1 to get afeel for the distri but i on of 1 at enci es 
betweenthe best and worst cases. 


4. 4. 2 Ijrbr i d Iht - Hee 

Thbl e 4.6 describes several sizes of hybr i d fat- trees. Thbl e 4. 7 s uimar i zes t he 1 at ency 
r ange for t hes e configur at i ons . Justasforthefullfat-treesthe wor st-case 1 at ency shown i 
f or a connecti on t hr ough the tree root withthe 1 onges t possible wires betweenstackst ages 
on both the up and down tree traversal. Agai n, Secti on 4. 1 descri bes the di stri buti ons of 
wire delays for configurations such as these. 
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Ninber of 

H*ocessors 

Latency of 
AKrst Chse 

Gbnnect 

lfest Chse 

Latency o 
Wrst Chse 

f Qoen 

Lfest Chse 

64 

40ns 

20 ns 

50ns 

30ns 

4, 096 

100 ns 

20 ns 

120 ns 

30ns 

262,144 

200 ns 

20 ns 

230ns 

30ns 


Thbl e 4. 5: Ril 1 Fat - Fee Lat enci es 


Ninber of 
Ifeocessors 

H del ta 
Stacks 

Uii t Thee 
Stacks 

Tbtal 

Gbnponent s 

768 

64 _E?i 2 

1 IF 64* 

656 

3, 072 

64 _E?48 

4 IF 64* 

3, 392 

12,288 

64 _E?192 

16 IF 64* 

16,640 

49,152 

4096 Fh 

80 IF 64* 

45,312 

196,608 

4096 8 

320 ff 1 64 * 

230,400 

786,432 

4096 92 

1280 [2^64* 

1 , 118, 208 


Thbl e 4. 6: Hybi r d Fat - Ttee Har dware Bequi r ement s 


Ninber of 

Ifeocessors 

Latency of 
Wrst Chse 

Gbnnect 

lfest Chse 

Latency o 
"Wrst Chse 

f Qien 
lfest Chse 

768 

90ns 

20 ns 

110 ns 

20 ns 

3, 072 

100 ns 

30ns 

120 ns 

30ns 

12,288 

130ns 

40ns 

150ns 

40ns 

49,152 

170ns 

20 ns 

200 ns 

20 ns 

196,608 

200 ns 

30ns 

230ns 

30ns 

786,432 

290ns 

40ns 

320ns 

40ns 


Thble 4. 7: Hybrid Fat - Ttee Latencies 


4. 4. 3 Mil 1 H del t a 

Thbl es 4. 8 and 4. 9 sumxari ze the characteri sties for some bi del ta networks of conpa- 
r abl e size tot he full fat- tree and hybr i d f at -1 r ee configur at i ons 1 i s t ed above. For t hi s cas 
t he war s t - cas e 1 at enci es are for t he case in whi ch t he wi r es are naxi nal 1 engt h bet ween 
stages. The best-case latencies assume a single cycle of delay due to inter-stage wiring. 
Section4.2.3 describes the delaycycle distri but i ons for t hi s configur at i on. 
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Ninber of 

R*ocessors 

H del ta 

Stacks 

Tbtal 

Gbnponent s 

64 

Bq4% 

48 

256 

B 256:2 

256 

1, 024 

Bq4% B 16)2 

1, 280 

4, 096 

Bq4% B 64^ 

6, 144 

16,384 

Bq4% B 256:2 

28, 672 

65,536 

B 64* B 64* B 16^ 

131,072 

262,144 

B 648 B 648 

589,824 

1, 048, 57i 

)B 648 B 648 B 256:2 

2, 621, 440 


Thbl e 4.8: Bidelta Net work Bar dwar e Bequi r ement s 


Ninber of 

R*ocessors 

Latency of 
AStst Chse 

Gbnnect 

Bfest Chse 

Latency o 
Wrst Chse 

f Qien 

Bfest Chse 

64 

30ns 

30ns 

30ns 

30ns 

256 

40ns 

40ns 

40ns 

40ns 

1, 024 

70ns 

70ns 

80ns 

80ns 

4, 096 

90ns 

80ns 

100ns 

90ns 

16,384 

110ns 

90ns 

120ns 

100ns 

65,536 

160ns 

110ns 

180ns 

130ns 

262,144 

210ns 

120ns 

230ns 

140ns 

1, 048, 571 

5 310ns 

130ns 

330ns 

150ns 


Thble 4. 9: H del t a Net work Lat enci es 


4.5 R>uti ng Stati sti cs 

Iki ng t he si npl y pr obabi listic node Is for rout i ng anal ysi s devel oped i n [ Kni ght 90] , 
this section provides sons basic routing statistics for the network topologies describee 
her e. St at i s t i cs are pr ovi ded f or all of t he configur at i ons det ai 1 ed i n t he pr evi ous s ect i o: 
All statistics suimar i zed her e are has ed on t he net work no del presented in Section 1. 2. 

Thi s means t hat each pr oces s or wi 11 us e at nos t one of its t wo i nput s i nt o t o t he net work 
at a gi ven poi nt i n t i me. The pr obabi listic mode Is usedto derive these statistics si npl y 
characterize r out i ng pr obabi 1 i t y i n t er ms of net work 1 oadi ng. The effect of input queuing 
at each source is not modeled. 

TTaffb distribution will have a considerable effect on the actual perfornance of any 
of these networks. The fat-tree s t ruct ured net works require considerable coimunication 
localityinorder to per f or mopt i nal 1 y. Each net work i s consi deredwiththe 1 oadi ng di s t r i bu- 
t i ons f or whi chit is mos t f avor abl e. Al ong wi t h t he r out ingstatistics, this section pr ovi d< 
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characteri zati ons for the 1 ocal i ty requi red by each network configurati on to achi eve near 
optinal perfornance. 


4.5.1 Rill Iht-Ttee 

Afull fat- tree net work i s opt inizedto t ake advant age of 1 ocal i t y. The act ual cons t r uc- 
t i on and bandwd dt h al 1 ocat i on det er ni ne t he 1 ocal i t y char act eri sties necessarytoutilize th 
net work nos t effect i vel y. The opt i nal 1 oadi ng case occur s when each connect i on t hr ough an 
BNl r out i ng conponent i s equal lylikelytobe nade t hr ough each of t he conponent ’ s out - 
put s . Wt hi n a uni t tree structure, this means t hat a connect i on i s equal lylikelytocross over 
at eachlateral cross over s t age. In a uni t tree s t age ot her t han t he root, t he connect i on f ur- 
ther upthe tree will alsobe utilizedwdth equal likelihoodas eachlateral cross over. Thki n 
t hes e t wo cons i der at i ons together, t he di s t r i but i on of connections through each tree 1 evel 
in the same unit tree is r oughl y e qual; the di stri buti on of connections through unit tree 
stages dr ops off by a f act or of f our at eachsuccessive stage t owar d t he r oot. 

Ther e are t wo ways of 1 ooki ng at the 1 ocal i t y di s tri buti on. As des cri bed, we can t hi nk 
of the di s t r i but i on i n t er ns of t he pr obabi 1 i t y of r out i ng through a gi ven hei ght in the 
fat - tree. Thi s is t he pr obabhiji (ti^„ -Phat a processor will need to coimuni cat e wi t h 
ay process or a fixed di s t ance, q away f r omt he s our ce pr oces s or . The pr obabi 1 i t y t hat t he 
processor will need to comruni cate wi th a pniicdarpT oces sot a fixed di stance f ormthe 
s our ce pr oces s or is an al t er nat e way of vi e wi ng t hi s 1 ocal i t y. Thi t y, P 

i s si npl y t he pr evi ousl y ment i oned pr obabi 1 i t y nor nal i zed by t he number of pr oces s or t hat 
are 1 ocat ed t he fixed di s t ance away f r omt he s our ce pr oces s or. Thbl e 4.10 t hr ough 4.12 
sumrar i zes the localitydistri but i ons as sumed for t he f ul 1 f at -1 r ees cons i de red here in term 
of these 1 ocal i t y measures . 

Wththeselocalitynodels, Fi gur es 4. 8 t hr ough 4. 10 showt he rout ingstatistics on each 
of thesefull fat- trees. Each graph i ncl udes a s epar at e curve for t he rout ingstatistict hr ouj 
each 1 evel of t he f at -1 r ee. The t opmos t cur ve char act eri zes the statistics for a connect i o: 
through the first 1 evel of the fat - tree. Successi ve curves down the graph gi ve routi ng stati s- 
t i cs for r out i ng t hr ough successive tree levels; the bott omros t curve char acterizes rout i ng 
through the tree root. 

Fi gur e 4.11 shows t he nor nal i zed rout i ng pr obabi lities for these three sizes of full fat 
trees. The nor nal izedstatistics are r oughl yidentical for the three sizes, sot he i ndi vi da 
normal i zed curves ar e i ndi s t i ngui shabl e. The normal i zed pr obabi lities gi ve an i dea of over al! 
net work perfornance under t he as sumed 1 oadi ng condi t i ons . The nor nal i zed pr obabi lities 
are det er ni ned by wei ght i ng t he pr obabi 1 i t y of rout i ng t hr ough a gi ven tree 1 evel by t he 
pr obabi 1 i t y t hat a connect i on can succes s f ul 1 y be nade t hr ough t hat tree level as surma - 
ri zed i n Equat i ons 4.43. 


nax(n) 

Prior m Party ' Povte( ^ (4.43) 
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n 

Party ( ^ 

Pparticid ar( ^ 

i 

0. 333 

0. Ill 

2 

0. 333 

2.78 X 10 s 

3 

0. 333 

6.94 x 10 s 


Thbl e 4. 10: Locallity Structure of Pul 1 Fat - Pees (64Processors) 


n 

Party ( ^ 

Pparticid ar( ^ 

i 

0. 278 

9.27 X 10 3 

2 

0. 278 

2.32 x 10 3 

3 

0. 278 

5.79 X 10 s 

4 

5.53 X 10 3 

2.88 X 10 4 

5 

5.53 X 10 3 

7.20 X 10* 

6 

5.53 X 10 3 

1.80 X 10* 


Thbl e 4. 11: Locallity Structure of Ril 1 Fat - Fees (4096 Processors) 


n 

Parry ( ^ 

Pparticid ar( ^ 

i 

0. 274 

9.13 X 10 3 

2 

0. 274 

2.28 X 10 3 

3 

0. 274 

5.71 X 10 s 

4 

3.17 X 10 3 

1.65 X 10 4 

5 

3.17 X 10 3 

4.13 X 10* 

6 

3.17 X 10 3 

1.03 x 10* 

7 

1.06 X 10 3 

8.62 x 1 (f 

8 

1.06 X 10 3 

2.16 X 10* 

9 

1.06 X 10 3 

5.39 X 10 s 


Thbl e 4. 12: Locallity Structure of Ril 1 Fat - Fees (262144 Processors) 


4. 5. 2 Ijrbri d fht - 'See 

The hybr i df at- tree also t akes advant age of 1 ocal i t y mich 1 i ke t he f ul 1 f at -1 r ee. Si nee 
onl y one- quar ter of t he bandwi dt h out of t he fir s t s t age connect s to the fat- tree structure, 
opti nal per for nance occurs when three - quarters of the processor i ni ti ated trade is local to 
t he bi del t a 1 eaf cl us t er s . Wt hi n t he 1 eaf, t he di s t r i but i on of t r afft i s uni f or iq t hat is, i 
equal 1 y 1 i kel y t o need t o connect t o any of t he pr oces s or s wi t hi n t he 1 eaf. For connect i ons 
t hr ough t he t r ee, t he arrangement is i dent i cal tot hat of afull fat- tree sot he distri but i c 
of r eques t s by tree level will pr ogress inthe s ane nanner . Thbl es 4.13 and 4.14 suimar i ze 
t he di s t r i but i ons ass une d f or t he hybr idfat-tree networktr afft. 
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1 
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1.45 X 10 4 
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8.33 X 10 3 

5.78 X 10 4 

1.45 X 10 4 

3.62 x 10 s 

3 

8.33 X 10 3 

1.45 X 10 4 

3.62 x 10® 

9.04 x 10 s 


Thbl e 4. 13: Locallity Structure of Hybr i d Fat - Ttees (Si ngl e Uni t Fee St age) 
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6.95 X 10 3 

1.89 X 10 s 
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Thbl e 4. 14: Locallity Structure of Ful 1 Fat - 'll ees ( Two Uni t Ttee St ages ) 


Fi gures 4. 12 through 4. 17 show the routi ng stati sti cs for the hybri df at-trees described 
i n Sect i on 4. 4. 2. The first three graphs are for the configurations with one stage of unit 
trees, while the later three graphs are for the two tree stage configurations. The topmost 
curve in each graph r epr es ent s t he rout i ng pr obabi lities wi t hi n t he bidelta cl us ter. Each 
succes si ve curve down a graph shows rout ingstatistics for connect i ng t hr ough asuccessively 
higher level in the tree structure. 

Figure 4.18 shows the nor nal i zed rout i ng pr obabi 1 i t i es for these si x hybr i d f at -1 r ee 
configurations. The nor nal i zed s t at i s t i cs for hybri d fat-trees with one or two unit tree 
stages coincide as long as the size of their bi delta leaf clusters is identical; that is 
normal izedstatistics differ onl y by t he size of t he bi del t a 1 eaf cl us t er. The t opmos t curve 
shows statistics f or hybr i d f at -1 r ee^ubi Agl 4h 1 eaves , t he ni ddl e ^obi ifel t a 
leaves, and t he bot t omf q ®2 leaves . 

4.5.3 Rill Hdelta 

To a processor on a full bi delta network, all destinations are equally distant. The 
rout i ng t o al 1 des t i nat i ons i s t opol ogi cal 1 y i dent i cal . The bi del t a net work wi 11 pr ovi de i 
best over al 1 per f or nance when mes s age t r affc i s uni foridydistri but ed across all processors; 
t hat is, its most f avor abl e 1 oadi ng condi t i on occur s when each s our ce wd 11 open a connect i on 
t o al 1 destinations with equal likelihood. This flat, randomdi stri buti on of connections is 
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Fi gure 4.18: Nor rial i zed Bout i ng St at i s t i cs for Hybr i d Fat - Ttee 


essentially the extreme case in which no locality is exploited. As traffi: deviates from 
this flat distribution, the perfornance will deteriorate. Figure 4.19 shows the routing 
statistics for the bidelta networks described in Section 4. 4. 3. The topmost curve in the 
figure corresponds to a 3 stage, 64processor, bi del t a net work. Each successi ve curve down 
t he gr aph plots statistics for a network wi th an addi t i onal stage of r out i ng, naki ng t he 
t ot al net work size a f act or of f our 1 ar ge. The hot t ommos t cur ve t hus corresponds t o a 10 
stage net work support i ng 1, 048, 576 process or s. 
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5. (inclusion 


Chapters 2 and 3 presented the topology and construction details for constructing fat- 
tree style networks using Ttansit technology. Chapter 4 then s umrar i z e d nany of the 
par amet er s and char act eristics i npl i ed by t he net work s t r uct ur e i n or der t o pr ovi de a bas i s 
f or conpar i s on wi t h ot her ar chi t ect ur es . Thi s chapt er s uimar i zes s one of t he r equi r ement s 
and cons equent s of these fat- tree structures and i dent i fie s conponent s of t he desi gn whi ch 
night merit further study. 

5.1 Rout i ng (b up one lit Requirerwnts 

The manner i n whi ch t he BNl rout i ng conponent coul dbe utilizedas the prinitive 
r out i ng el ement i n net work cons t r uct i on was always a pri nary cons i der at i on throughout 
thi s work. 

In order to build a maxi rial 1 y fault tolerant network structure, two properties of the 
BNl conponent were identified as critical. The ability to s epar at el y configure the byte 
dropping characteristics of each input port was est abl i shed as necessary for opt i rial dis- 
persi on of connecti ons ( Secti on 3. 1. 2 and 3. 3. 5). The al ternate configurati on of BNL as an 
i nde pendent pai r of 4x4 cros sbar s wi t h one out put in each 1 ogi cal di r ect i on proved cr i t i cal 
t o t he cons t r uct i on of a net work resilient agai ns t si ngl e conponent f ai 1 ur es . Thi s al s o hel pi 
ni ni m ze t he effect s of each conponent f ai 1 ur e ( Sect i on 2. 1. 2 and 3. 3. 4). 

For t hi s ne t wor k s t r uc t ur e t he onl y f unc t i onal i t y t hat is lackingfr omt he c ur r e nt BNl 
desi gn i s t he abi litytodeal with round trip del ays on 1 ong wires (Secti on 3. 6). Thi s abi lit 
t urns out t o be neces s ar y f or opt i rial per f or nance when bui 1 di ng any net works of these sizes 
re gar dl es s of whet her t he net wor k i s arranged as abideltaor fat- tree net wor k (Secti on 3. 6 
and 4.2). Section3.6describes aschemeforrectifyingthis deficiencyinfuturerevisi ons c 
BNL. 

5.2 Characteristics 

The net wor k desi gn pres ent ed over comes t he pot ent i al pr obi errs i dent i fled i n Sect i on 1. 1 
due t o net work si ze, deconposi ti on, andtopology. Anunber of i nteresti ng structures were 
devel oped t o s ur mount t hes e pr obi errs . The result is a fat- tree net work s t r uct ur e t hat has 
a nunber of desi rable properties. 

5.2.1 Gbnst ruct abl e 

An overri di ng concern i n the devel opment of the physi cal network structure was estab¬ 
lish! ng a desi gn whi ch coul d be physi cally realized in the real wor 1 d. At t ent i on was pai d 
to t he real wor 1 d cons t r ai nt s pos ed by t echnol ogy 1 i ni t at i ons . AL s o, s one care was t aken 
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to mninize the cons t ruct i on conpl exi t y. The i nt ent ms to desi gn a s t ruct ure whi ch coul d 
realisticallybe f abri cat ed wi t hi n t he next f e w year s wi t hout rel yi ng on any t echnol ogi cal 
breakthroughs. At the sane tine, attention ms paid to technology trends so that the 
ar chi tecture itself mul d still prove val uabl e wi t h i npr oved t echnol ogi cal capabi 1 i t i es . 

The uni t tree structures (Section 3.3) provi de a powerful component for dealing with 
technological size limitations. They provide a decomposition of the large network into 
component s t hat can be r el i abl y f abri cat ed. The uni t tree also serves as a bui 1 di ng bl ock, 

1 i ni t i ng desi gn conpl exi t y. Si nee each uni t tree is vi r t ual 1 y i dent i cal, t he conpl exi t y o 
configuring a network is reduced to that of configuring a couple of unit trees and then 
i nter-connect i ng uni t trees accordingly. 

The hoi 1 owcube geonetr y (Section3. 4. 3) provi des a si npl e and str ai ghtf or mr dmanner 
i n whi ch t o ar r ange uni t trees. Its structure re fleets the nat ur al gr owt h char act eristics oi 
the fat- trees cons tructedwith uni t trees. Wthcareful design(Section3.5), the hoi 1 owcube 
s t r uct ur es can provi de a framework for t he i nt er connect i on of uni t trees; inthis manner, 
they will significantly simplify the complexity of configuring and maintaining unit tree 
i nt e r c onne c t i ons . 

5.2.2 Ihul t Tblerance 

Utilizingthe properties of BN1 and j udi ci ous wd r i ng pat t er ns , t he f at -1 r ee net mr k wi 11 
be re as onabl y f aul t t ol er ant. All t he fat- tree structures describedareresilient agai ns t si 
component f ai 1 ur es ( Sect i ons 2.1.2 and 3.3.4). Wththe redundant pat hs t hr ough t he net¬ 
work, t he effect s of any component f ai 1 ur es ar e ni ni ni zed. Sect i ons 2.1.1 and 2. 3 presented 
cons trai nt s on wi ri ng t o maxi ni ze the f aul t t ol eranee of thes e net mrks . Addi ti onal 1 y, t he 
accessibility affor ded by t he hoi 1 owcube structurewill all owf aul t s t o be r epai red whi 1 e t he 
netmrk i s i n ope rati on ( Sect i on 3. 5. 4). 

5. 2. 3 Qieap Bout i ng 

The net mr k is structuredto make rout i ng on t he f at -1 r ee hot h concept ual 1 y and pr ac- 
tically si npl e. Thi s al 1 ows rout ing to be performed cheapl y as describedin Section 3. 8. 

5. 2. 4 Performance 

The resulting fat- tree has ed net mr ks at t enpt t o ni ni ni ze t he 1 at ency of t he i nt er con¬ 
nect whi 1 e maxi ni zi ng t he pr obabi lity of perform ng a successful rout e. 

Lat ency i s i npr oved over a nai ve approach i n a nunber of mys . Bout i ng on t he upmr d 
tree pat h(Section2.2.4) reduces by a f act or of t hr ee t he nunber of stages of rout i ng i ncur r ed 
whi 1 e t ravel ingupthetree. Utilizingthe BN1 component, whi ch i s a cons t ant sizedswitch, 
keeps the routing delay at each stage constant at the cost of incomplete concentration 
( Secti on 2. 2. 4). Miki ng the routi ng cl ock cycl e i nsensi ti ve to the si gnal del ays i ncurrec 
while crossing long wires, allows fast pipelining of data t r ansni s si ons; a long wire onl j 
affect t he 1 at ency of a connect i on whi ch act ual 1 y t raver sesit (Section3.6). 

Wen bui 1 di ng net mr ks of t he nagni t ude describedhere, it is clear t hat locality will 
be neces s ary t o obt ai n r eas onabl e per f or nance. The fat-trees constructed fromuni t trees 


78 


have a natural architectural 1 ocal i t y r es ul t i ng f r omt he allocation of hardware. This nat¬ 
ural 1 ocal i t y i s suimar i zed for full fat- trees in Section 4.5.1 and f or hybr i d fat- trees 
Sect i on 4. 5. 2. Knowi ng t hi s opt i nal level of locality nay prove us ef ul i n desi gni ng par al lei 
al gor i t hns and s of t war eforlargesystens. Wth proper 1 ocal i t y, t he r out ing statistics fo: 
these fat- tree structures is re as onabl e; t he pr obabi 1 i t y of obt ai ni ng a succes sf ul rout e i ] 
f ul 1 y 1 oaded net work is inthe 70%to 80 %r ange even f or net works wi t h t hr ee - quart er s of a 
nilli on process or s (Secti ons 4.5.1 and 4. 5. 2). 

5. 3 Rit ur e 

I have at t enpt ed to devel op t he cons tructionof t he se fat- tree structures inreas onabl e 
det ai 1. Sufici ent det ai 1 was pr ovi ded t o deirons t r at e t he f eas i bi 1 i t y of such net works . Al s o, 
this devel opnent gives enough information about the structure that good esti nates of 
critical parameters such as har dware requi rement s and per f or nance can be obt ai ned. This 
development, however, is by no means definitive. Anunber of issues are openfor further 
s t udy and opt i ni z at i on whi le afewissues re qui r e addi t i onal specification. 

5. 3. 1 Bout ing Statistic Mdel i ng 

The r out ing statistics pr ovi ded in Sect i on 4.5 model t he pr obabi lity of successfully 
finding a route through the network as a function of network loading. As such, it does 
not t ake i nt o account the nanner i n whi ch t he net wor k wi 11 nor nal 1 y us ed; in general, 
when a ness age fails to get r out ed, the processor will resend it later. Thi s wi 11 have a 
feedback effect on the network loading. In order to see network perfornance under this 
more re alistic model of net wor k t r affc, a more detailedstatistical model mis t be utilized 
whi ch takes i nput queui ng i nto account. Addi ti onal 1 y, thi s anal ysi s woul d provi de a means 
for estimating the amount of time required, on average, in order to acquire a conplete 
connecti on through the network. 

5.3.2 Simulations 

As a conpl ement to statistical modeling, it will be enli ght eni ng to si mil at e t hi s net¬ 
work s t r uct ur e. Thi s wi 11 pr ovi de a good means of checki ng t he val i di t y of t he s t at i s t i cal 
assumptions under various loading conditions. 

5.3.3 I nt er connect i on Dbt ai 1 s 

Sect i on 2. 3 pr ovi ded a nunber of constraints necessary to obtain good perforimnce. 

As menti oned there, it is unclear whether or not the expansi on properti es proposed by 
Lei ght on and Miggs shoul d be used to further di ctate the detai 1 s of i nterconnecti on wi ri ng. 
Che e we have f unc t i onal s i mil at i ons , we s houl d be abl e to e s t abl ishthe inport anc e of t he s e 
cons t r ai nt s . Chce t hi s is known, det ai 1 ed wi r i ng pat t er ns mis t be devel oped for t he uni t 
tree structures. 

The wiring between stack st ages merits addi t i onal at t ent i on and det ai 1. If opt i cal i nt er - 
connect i on i s to act ual 1 y be us ed, addi t i onal s t udy on t he i nt egr at i on of t hi s t echnol ogy wi 11 
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be requi red. Addi t i onal wor k on s chenes for adapt i ve al i gnment w 11 be a vi r t ual necessity 
i n or der t o ut i 1 i ze f ree-space i nt er connect i on on t hi s s cal e. For naxi nal effci ency, custom 
conponent s nay need to be desi gned and f abr i cat ed f or t hi s pur pos e. 

5. 3. 4 Gfeonet ry 

As statedinSection3.4.6, the hoi 1 owcube geonet r y i s not known t o be opt i nal. 11 s pos - 
si bl e f ut ur e s t udy nay produce a geonet ry t hat pr ovi des s hor t er i nt er connect i on di s t ances 
for t he basi c net work s t r uct ur e des cr i bed whi 1 e r et ai ni ng t he pr oper t i es of nai nt ai nabi 1 i 1 1 
and c ons t r uc t i bi 1 i t y. 

5.3.5 Racket Switching 

The basi c TTansi t net worki ngscheneis circuit switched. Qrcuit switchingis usedto 
ni ni ni ze t he 1 at ency i nher ent i n get t i ng a res pons e f r oma re not e pr oces s or on t he net work. 
Using circuit switching, no buffer i ng i s needed wi t hi n t he net work. Thi s avoi ds pr obi errs 
due t o i nt er nal buffer overflowor net work conges t i on by bl ocked packet s . 

For large networks where the latency from one end of the network to the other is 
gr eat er t han t he t i me requi r ed t o t ransni t t he s t andar d quant a of data, circuit switching 
nay ineffcientlyutilize net work bandwi dt h. If such i s t he cas e, it ni ght be wor t hwhi 1 e t o 
consider packet r out i ng s chenes . The basic fat-tree structure and i nt er connect described 
her e woul d be appl i cabl e t o a packet s wi t ched net wor k s chene. The di ffer ence t hat arise 
woul doccur inthe r out i ng pr ot ocol and pol icies. Almost all of these differences woul d t hus 
be 1 i ni t ed t o t he desi gn of the cache - contr ol 1 er and rout i ng conponent. 

5.3.6 Gbnstruct R*ototypes 

Cer t ai nl y, t he mos t defini t i ve way t o e val uat e t he wor t h of t hi s f at -1 ree net wor k s t r uc- 
t ur e is to act ual 1 y cons t r uct pr ot ot ypes . The cons tructionexercise will guar ant ee t hat no 
essential cons t r uct i on det ai 1 s are overlooked. Such cons tr uct i on wi 11, no doubt, uncover 
issues and problems not yet considered. 
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